This article was originally published on Express Computer, authored by Bert Hooyman, Chief Architect, Mphasis
First, the basics. Observability is a measure of how well the internal states of a system can be inferred from knowledge of its external outputs (Wikipedia). The original concept was introduced in 1960, in the context of linear dynamic systems. We won’t go into the formal proofs for observability of linear time-varying systems here.
As observability is based on system outputs, we don’t need to interfere or interact with a system to properly observe it. We only have to consider the outputs of the system. Examples of output include logs, traces, and metrics.
Monitoring vs Observability
How does observability differ from system monitoring? Aren't these things the same? Almost, but not quite. Observability is a property of a system, whereas monitoring is an activity, it is the active part of collecting data. Monitoring helps assess the health of a system and how its components work together.
Monitoring will get you information about your system and let you know when there’s a failure, while observability grants an easy way of understanding where and why that failure happened, and what caused it (Anna Geller, 2021). If you monitor a system continuously, the system may become observable assuming that its logs, traces, and metrics provide sufficient information.
Observing a Distributed System
Today, we find ourselves re-architecting monoliths into capability-based distributed systems. Needless to say, distributed systems are always complex. As Cindy Sridharan states in her book :
• No complex system is ever fully healthy.
• Distributed systems are pathologically unpredictable.
• It’s impossible to predict the myriad states of partial failure various parts of the system might end up in.
• Failure needs to be embraced at every phase, from system design to implementation, testing, deployment, and, finally, operation.
• Ease of debugging is a cornerstone for the maintenance and evolution of robust systems.
Clearly, observability is a highly desirable property for complex distributed systems, including microservices-based decompositions of traditional monoliths. As modern system architectures emphasize the role of capability-based decomposition, the importance of observability in a solution architecture rises. Remember, no complex system is ever fully healthy. Thank you, Cindy.
Observing a system deployed on the public cloud
Observability is not just meant for managing the health of a system, it is also useful for measuring resource consumption. In a public cloud deployment, observing component resource consumption is used for two reasons: for dynamic resource allocation (i.e. scale-out and scale-in) and for cloud financial management, also known as FinOps.
Observability as a cross-cutting concern
In a distributed system, observability as a system property turns into a cross-cutting concern across all components of that system. In software engineering, we may apply aspect-oriented programming techniques to support observability (Wikipedia).
Service meshes use network proxies paired with each service in an application and a set of task management processes (Wikipedia) . These proxies can take care of many of the cross-cutting concerns including observability, security, and failure handling. Using proxies, software engineers don’t need to concern themselves with such features; the mesh magically ‘takes care of it’.
The disadvantage of a service mesh is that the introduction of proxies effectively doubles the number of 'moving parts' of a system. Alternatively, the use of a micro-service chassis framework, promoted using a service template, is a very effective means to a coordinated introduction of tracing, health checks and metrics. Through service templates, observability is integrated with the microservice component runtime.
Control Tower: where ends meet
In a distributed system, observable components expose their internal state, providing a way of understanding where and why failures happened, and what caused them. But this insight comes at a deeply technical level, only. A component is unaware of the business context it is part of – no way a component can inform us that a revenue-critical process flow is disrupted because of its behavior. For this process-level insight, we need to look elsewhere. Platform Control is where component behaviors and information flows are interpreted in the business process context.
In a platform-based portfolio architecture, platform control implements the ‘Sense & Response’ pattern for maximizing autonomous operations yet providing 360° visibility into an integrated platform. Here, we leverage the observability of everything 'from metal to experience'.
Speaking of control, and referring back to the origins of observability, the observability and controllability of a linear system are mathematical duals. In a software architecture context, this duality applies just as well; we know controllability as manageability. So, with observability in place, we will have to turn our attention to manageability (manageable: capable of being managed or controlled). I will leave that discussion to a subsequent blog.