There are a number of good practices to follow when setting up and maintaining systems. Some of these rules are obvious and some not so obvious. One of these rules is to make sure your monitoring tools have no dependency on the thing being monitored. While this may seem obvious, lots of organizations still make this mistake. Sometimes it’s not even clear to them that they’ve created such a dependency. Microsoft ran into this very issue last week during an Azure outage situation. As Mary Jo Foley reported in her All About Microsoft Column, the dashboards used by Microsoft to alert customers about the outage were behind the same authentication system that was having issues. If a company with the size and resources of Microsoft can miss this, then so can the average operations, DevOps, and SRE Teams supporting applications.

Whenever possible, the system and the monitoring tool for the system should not share a common single point of failure. Here’s another example to help illustrate this point. Imagine there’s a system which uses a dashboard for customer reports. Using the same dashboard servers for outtage notifications / systems problems would be a problem if the dashboard breaks. It means more work, but monitoring systems need to be as isolated as possible from the thing that is being monitored in order to maintain integrity.

There are numerous ways to achieve such a separation. The first is to consider a cloud based Software as a Service (SaaS) offering. In some ways, this is like outsourcing monitoring outside your organization. Doing so eliminates any and all reliance on your own gear and removes any desire to use existing gear in such a way that could generate a shared single point of failure. It’s very common in some environments to maximize the usefulness of on-hand equipment. It may be very tempting to create a virtual machine for a monitoring tool in a cluster that is being monitored by that same tool. Using an external service takes away those considerations.

If you are in the cloud, isolation can happen at a few different levels. Some create a separate user account that contains the monitoring tools. The monitoring account is granted permissions to connect to and report on the status of the resources owned by the other accounts. This is similar to using a SaaS, except that it’s managed in-house. If the cloud used is AWS, a different VPC is used to house the management application. Azure and Google Cloud Platform (GCP) have similar options for grouping and isolating resources.

An effective system design should account for and attempt to mitigate as many potential issues as possible within reason and budget. Monitoring is one of those things that seems simple but can be overlooked when it comes to blind spots. Co-mingling the system being monitored with the monitoring system is one such mistake to avoid.

References

Foley, M. (2020, October 01). Microsoft's Azure AD authentication outage: What went wrong. Retrieved October 06, 2020, from https://www.zdnet.com/article/microsofts-azure-ad-authentication-outage-what-went-wrong/

Foley, M. (2020, August 19). Microsoft has a plan to try to improve Azure outage assistance. Retrieved October 06, 2020, from https://www.zdnet.com/article/microsoft-has-a-plan-to-try-to-improve-azure-outage-assistance/