Production IT Systems - Monitoring and Alerting

Introduction

Monitoring gives insight into an application, that how it is working, is there any problem and where is the problem. When people working with an application know to answers of these questions, they will be able to successfully operate the application, help their customers and keep up the business.

Application monitoring is a paradigm of design that initiates with the start of design/ architecture phase. A well designed application is well equipped with self-monitoring capabilities and placeholders for other monitors to plugin into it. The goal of application monitoring is to reduce MMTK (mean time to know) and thus MTTR (mean time to recover). Depending on cases monitoring can also extend to capabilities of predictive fault assessment and therefor enabling provisions for preventive measurements.

So the big three questions monitoring answers are:

How the application is working?
Is there any problem in any component of the application?
If problem is there, which component is affected?

Audience of Monitoring

Monitoring goals differ by the audience. A support personal is interested in CPU spikes whereas a business executive will be oblivious to this information. The information that is crucial for one person will be of no consequence to other. This makes it critical to know, what information we can draw from monitoring particular asset and who needs what.

A business executive is more interested to know, what an impact of the problem is. Business executives might not be interested in knowing that CPU spike or memory depletion caused the downtime on the particular server. They might be interested in knowing that how many customers are impacted or what geography is impacted due to downtime of that particular IT resource (IT resource can be both or either software and hardware).
A system administrator (server support personal) will not be interested in knowing what business is impacted. To system admin, information that particular process used a lot of memory or a particular file consumed a lot of disks will be more crucial.

These examples represent that information consumption varies across the enterprise. Information can also be classified into two broad categories.

Information directly from data (ex: CPU usage hits 90%)
Derived information or knowledge from data (server x is down, so customer impact is y %)

Derived information may not be part of application monitoring itself. But the data collection by monitoring tools is an input for the derived information. For example, an application support personal is collecting information on response time from a particular component to him/ her it can be software, hardware problem but for a business executive, this information means customer experience degradation.

Reporting or presentation of an event will decide the derived information from the captured data.

Information Technology Resource Stack

Monitoring need to be a holistic approach as points of failures are multiple and redundant. The illustration below defines the avenues for monitoring in a typical IT stack.

monitoring scope in IT stack

One should try to draw proper abstraction so that right tool/technology can be used at the right place. Layer 4 (Application Runtime) till layer 7 (Business Process) can be categorized within a broad concept of application stack and layer 1 (data centers) till layer 3 (operating system) within broad concept of platform.

KPI’s to be monitored in application stack

There is a set of KPI’s that we need to monitor or should be monitored wherever possible. There is a key set of metric that need to collected wherever possible. The tuple is

{((response time)/ interval), ((Invocations)/ interval), ((parallel executions)/ interval), (failures)/ interval)}}

{(response time)/ interval}

Response time per interval is metric that averages out response time in any given interval. The length of the interval will be based on the requirement. Usually, if the transactional volume is high interval length will be less and vice versa. The mathematical function is defined as

Where: n = no of invocations in an interval, that are completed.

{(Invocations)/ interval}

Invocations per interval is a number of complete executions in given range. The start time must lie in the same interval; end time may lie outside the interval. All the executions must be counted in an interval irrespective whether they are completed or not.

{(parallel executions)/ interval}

In a multithreaded environment, parallel executions will happen. Since parallelism is capped with an explicit or implicit upper limit, it’s worth measuring and putting threshold around this metric from the performance perspective. Measuring this value requires counting active threads executing the same method/ routine/ function, in a given interval.

{(Failures)/ interval}

Metric to compute number of executions that failed with errors. While counting these executions, one should remember that errors are of following two types.

Soft errors: In most programming/ scripting languages errors are handled with try-catch. If an exception is handled, it’s considered as the soft exception. However, these errors do not constitute failure count. They could be monitored
Hard errors: Errors which are not handled or explicitly thrown from the execution flow are counted as failures.

Hard error count in an interval constitute failures per interval

Operating Systems

OS type & version (linux/ unix/ windows etc with version and release number)
CPU utilization (utilization aggregate and per core)
Memory utilization
Number of open sockets (network)
Number of open file handles.
Number of running processes
Load average (available on linux/ unix operating systems)
Memory consumption of key processes along with process name and details.

Application Runtime Environment

Version of runtime
Memory utilization inside virtual memory in case virtual runtime machine technology is used (JVM & CLR)
Memory allocation to the application process from operating system.
CPU utilization (of the process spawned by the runtime)
Thread model in case runtime supports multi-threading.
Location of binaries of runtime on file system.
Process ID (to find process is alive or dead)

Java/ .Net virtual machines (JVM – java virtual machine & CLR – common language runtime) are allocated maximum and minimum memory space, in which the applications run. Mostly it is static memory space to run applications (dynamic incremental memory allocations is also there, but it only adds memory to the available pool to an upper limit). This memory space utilization need to be explicitly monitored because even if operating systems has lot of free memory, virtual memory can run out of space causing slowness or even application crash.

Cots Product

Log files of the product (middleware/ database logs) for errors.
Log files for logging activity to see if the process is working, in case explicit monitoring of process in not available. This helps to figure out whether a process is running or is in a hung state.
Port pings wherever required via remote monitors.
Location of binaries of software on file system.
In case databases are used, and they have customized monitoring support, it should be utilized. Oracle database come with OEM (Oracle Enterprise Manager), and likewise many other databases come with their own monitoring tools. Database monitoring has its separate and unique KPI’s (key performance indicators), it’s not fully covered in this document.

Applications

Log files for errors and events. Events are to be monitored for business relevance, e.g. a security violation need to be alerted that got printed in the log file.
Process ID in case application is a script, running in standalone or scheduled fashion. If the process is executing without exposing any user interface or any port for a status check, it becomes hard to know whether it is running on a given server or not. Process ID becomes crucial information to test the process. If Process ID is not specified, then support has to struggle with name, or by trying reverse lookup using file or socket handles.
Measuring tuple for application endpoints (web, database, RMI, file, queue, sockets, mail)
Most of the time database is used as an integral part of an application. However, database monitoring should be separate but its applications endpoints should measure SQL executions for the tuple described above. In addition, it should also fetch out SQL text so that support personnel knows what SQL was executed without the help of application developers.
Measuring tuple for key subroutines (The key subroutines must be documented and flagged clearly at time of design and development). Key subroutine will include primarily business methods which make external/ internal call to database, file or socket. We are already measuring application endpoints, but it become hard to know that to what business flow a particular endpoint is associated. Measuring a named and known subroutine provides identity to implicit endpoint executions.
If an application is not planning to deploy tools to monitor the tuples, it should follow following procedures to record this information in separate log files (e.g. performance.log). This information will come handy in case of issues. Log file can be fed into any analytical software (spreadsheets, databases, dashboarding) for detailed analysis.
All the endpoints as discussed in point 3 & 5 must log out routine/ method name, start timestamp, end timestamp, key data (if possible) in a delimited fashion.
The logging should be done explicitly in application code if technology/ language is not supporting runtime or compile time code modification (Aspect Oriented Programming (AOP) enables runtime and compile time weaving in Java & .Net technologies). With AOP one can write monitoring code outside application and plugin it into an application at either compile or run time. This avoids code pollution and maintains the readability of application logic.
If using compliant technology then 6.b should be achieved implicitly by keeping monitoring code outside application logic.
A database can also be used to record the tuples. Using no SQL databases like Mongo DB is also a good practice. This again only in case if one has to monitor critical business flows without any off the shelf monitoring products.
If there is any key data which can serve as correlation id between different routines or applications, it should be logged down wherever it’s available while logging out other details.

Monitoring overheads & considerations

Nothing comes for free. If you apply monitoring you have to bear the cost on following factors.

Monitoring tool software cost (license/ development cost)

Implementation and sustenance cost.

Application response time overhead, in case agent based monitoring solution is utilized.

From response time and application overhead, monitoring solutions can be divided into two categories.

Agent-based solutions

Agentless solutions

Agent-based solutions

Agent-based solutions (also known as intrusive monitoring) deploy an on-site agent on the top of the operating system or even inside the application memory space. Both of these utilize resources of host server/ application to perform these activities. The implementer should take in account disk usage, memory usage and CPU usage of these agents. Many a time’s people implement aggressive monitoring strategies using agent-based techniques causing crunch of memory or CPU in burst cycles. This causes resource crunch on the host, causing the host to stall or even crash at times.

If one is deciding to use an application in memory agents like introscope, one should validate that the information dispersion/ logging thread is separate from application flow thread. One should also validate the message retention strategy and capacity in case agent are communicating information back to its master. COTS product in this regard are very reliable, but people can try out new products or even write their own tools. This aspect should be a great consideration in this case. In memory agents add overhead to response time as they use application transaction thread to read and record data, consuming CPU cycles from the same thread. Aggressive monitoring that requires a lot of CPU cycles or memory space and executes complex logic must be avoided in production environments.

Agent-Less solutions

Agent-based monitoring (also known as passive monitoring), is a technique where one does not deploy agent into application infrastructure/ memory space. This method relies on remote logins/ remote procedure calls/ API (application programming interface) calls. Mostly monitoring on this front is concentrated upon operating system parameters as operating systems allow remote logins and procedure calls, but the concept has also been extended in following methods.

End-user monitoring by network packet capture

Traffic taping by employing a hardware device in front of switch or router. This copied traffic then can be put to analytical engines producing metrics for response time from the server, network and even user behavior. This is a costly solution and should be employed in high-value applications which have end-user interface. An example of this monitoring is RUM (real user monitor) and Tealeaf.

Robotic monitoring

This technique uses robotic login attempts/ pings from remote geographies to verify not only response time but availability over the network. HP Topaz, Jmeter are example products of this technology.

Monitoring tools by function

Following table divides the monitoring requirement at different layers in more granularity. The monitoring products/ open source technologies are only examples and are subject to change on terms of cost and technology.

Log scrubbing tools can be a product or simple custom script/ application.

Monitoring scope and example tools (mentioned tools are not endorsed or recommended, they are just examples)

Alarm Management

Alarms are the output of monitoring systems. A lot of monitoring is put starting from infrastructure till application’s business, this means a lot of alarms are generated. Every alarm if not managed properly, converts into a unit of work. So if one has good monitoring to keep up business, one needs an army of people to tackle the work generated via alarms. This beats the goal of monitoring itself. Instead of saving money by investing in monitoring to ensure uptime, one starts to lose money by generating tons of alerts.

It is very necessary to do continuous housekeeping to reduce the volume of alarms. With the passage of time volumes, complexities and dependencies grow. It’s better to audit and tune them quarterly, semi-annually or annually, instead of waiting for alarm entropy to grow to the critical level. Below is the list of items that need to be tuned as part of an audit.

Tuning warning & critical thresholds for alerting.
Turning off alerts which have no business value. If people are not acknowledging alert they receive, over a defined period, the alarm must become a candidate for removal/ tuning.
Reduce redundant alarms from the same monitor. Redundancy should be across different monitors, not on same monitors.
Instead of sending alarms to individual users, user list must be used. User list must be configured using enterprise mail system, or any other systems that are used at the enterprise level.
Using additional tools/ technologies which help to control frequency attributes (e.g. alert only if a problem happens 3 or more time in 5 minutes). Many a time monitoring tool employed, provides this functionality as an out of box feature.
Alarms may be linked to enterprise change management system so that they can auto snooze selected alarms from particular IT resources.

In a mid-size to large size enterprise number of IT resources (Hardware, software) become large. To optimize operational expenditure of all these resources team are created for layers as segmented in Figure-1. The illustration below shows management structure of alerts and people by role and type.

event responsibility distribution

In terms of volume infrastructure team support large number of resources, then database team may support a large number of company-wide databases. Applications teams may support relatively less number of resources. So the size of teams in terms of volumes are as

Infrastructure Alarm Volumes > Database Alarm Volumes > Application Alarm volumes

Since application teams are most close to business, they need to be alerted from all layers. Consolidation and management of notifications are mostly done at the enterprise level to control cost and maintain segregation of duties. Multiple sets of tools/ software are used to achieve the desired state. This strategy is not only beneficial for application or infrastructure support only but also is a big enabler for crucial information capture for business drivers, impact studies, and real-time visualization capabilities. Data correlation at this stage provides a bottom up & top down drill down capability as value add.

In general, following methodology is followed in many organization for service management (part of ITIL framework).

alerting, ticketing & automation

Automated Tickets

Automated tickets are created to fix responsibility of issue resolution and also to keep an audit trail from inception to permanent closure. Tickets are to be created with a lot more information than an incident contains. This requires alert information to be corroborated with other sources like asset information, support staff information, application detail information. This additional information cuts turnaround time for person who is assigned to the ticket.

Notifications

Automated alerts can be sent to users in various message formats like EMAIL, SMS etc. This information must be corroborated with additional information from asset systems, staff and application information systems. The message should also contain information about possible actions and resolution in message body or as an external link.

message format

Automation

Many alarms have fixed redundant resolution steps. These alarms should be linked to some enterprise orchestration tool with DSS (decision support system, mostly rule based intelligence). With this setup the alarm is considered as trigger which fires resolution routine using orchestration system, auto correcting the problem. The auto correction event should also be considered as an alarm and send to team to leave an audit trail. The audit log can be maintained in emails, or an auto ticket (later is more preferred) system. Audit log should clearly mention following attributes.

Time stamp of problem alarm.
Auto correction rule identifier that was applied.
Auto correction execution timestamp.
Auto correction output. (mostly auto correction is done via scripts, scripts should be written to not only generate success and failure as output, but also to generate execution log)
Auto correction completed timestamp.

Conclusion

Monitoring is the essential paradigm of application design. Monitoring roadmap should be defined at the very conception of application itself. Many time people take monitoring as granted, and do not consider monitoring as part of the application but for external tools. These misconceptions create higher maintenance and sustainment cost down the lane. Every application has mostly common aspects of monitoring and some specialized requirements for monitoring (mostly business related), the roadmap should be able to identify specialized requirements in advance, so that proper monitoring product/ strategy can be devised/ developed.

Solutions for elusive IT issues