General
Monitoring can be used by utilizing an abstract service, which provides application audit functionality and integrates the classical understanding of application monitoring (see ITIL Event Handling). This functionality can be seen as kind of data warehouse, where external tools are connected to:
For any kind of monitoring, there are several topics which should be considered:
- Failure Detection
can be done on infrastructure and application side - Performance Degradation Detection
can be done on infrastructure and application side - Capacity Planning
can be done for hard resources (CPU, memory, etc.) from infrastructure side and for soft resources (thread pools, queues, etc.) from application side - User Reaction to Business Offerings
can be done from application side - Intruder Detection
can be done from infrastructre and application side
Failure Detection
While total failure of infrastructure and software can be easily monitored, partial failures can be caused by myriad causes. For instance: Partial hardware failures might cause downstream services to fail; this could also be related to misconfigured software or support software. Therefore detecting software failures can be done in one out of three fashions:
- An external system provides mointoring functionality for the application
- An agent inside the application system performs monitoring
- The application itself detects problems and reports them
Performance Degradation Detection
Performance degradation can be identified by by comparing current monitoring data with historic data. The following disciplines can be considered for this kind of monitoring:
- Latency
Latency has a single-user consideration and provides interaction times of services. This can be from users requests to user response, within service interactions or within service processing. - Throughput
Throughput provides a system-wide measure, e.g. the amount of disk reads per minute or transactions per second. It must be set into a correlation, e.g. how many users were logged into the system. - Utilization
Utilization is related to resources. It is important for collecting data of an application; this information – in correlation with applications or activities – is highly important for identifying the root cause of performance issue.
Capacity Planning
There are two different kinds of capacity planning:
- Long-Term Planning
Used to match hardware needs, wether real or virtualized, with workload requirements. Based on current resource utilization plannings for future utilization can be determined. - Short-Term Planning
Used to match ad-hoc needs, such as access to cloud resources or VMs. This planning is based on utilizations and – for most parts – also used for billing in cloud environments.
User Interaction
Monitoring of user interaction is important to provide support bussiness users with their requests; Therefore this accelerates the experienced service. It depends on four key areas:
- Latency of user requests
Whenever users need to wait on their screen for a reply, this is measured as services provided. Whenever a users needs to wait on a screen, the service will be used differently.
Google reported that delaying search result pages by 100ms to 400ms led to a reduced amount of searches executed with Google. For applictions, where users do not have the choice of different providers within a company, this is one key source for complains. - Reliability of systems
Not available services highly impact the experienced service. Failure detection can improve at this palce. - Effects of business offerings or user interfaces modifications
The measurements collected from testing (i.e. A/B testing) must be meaningful for the goal of the test. Data must be associated with variante A or B of a system. - Organizations set of metrics
Organizations have their own metrics to determine the effectiveness of their offerings and supported services. For any provided service, the used metrics to measure the user satisfaction, must be valid.
User Interactions can be used within two monitoring approaches:
- Real User Monitoring (RUM)
RUM is a passive monitoring, which records all user interactions with an application. It aims to assess the realized service level a user experianced; this way the changes to a service can be set in correlation with user behavior. - Synthetic Monitoring
Synthetic monitoring is similar to developers performing stress testing on an application. User behaviour can be scripted to emulate the system (e.g. via Selenium IDE or test automatization tools). This can also be part of automated user acceptance tests.
Intrusion Detection
Applications can monitor users and their activities to determine if activities consists with the intended role in an organization. This supports detecting attacks and authorization gaps within an application. Options such as locationing can also avoid abusing user credentials, especially if service providers come into place: Imagine an offshore resource uses on-site credentials to log into a system or a internship resource of a provider gains source code access. This might violate corporate security guidelines. Also network traffic can be monitored to identify abnormalities such as unauthorized users or violations of security policies. This kind of monitoring can also be used to scan for regulated ports to external sources (e.g. outsourced development).