Mphasis | Observation is a spectator sport

March 01, 2022

Observation is a spectator sport

Bert Hooyman

Chief Architect – Mphasis

My earlier articles on Observability, Controllability, and Interoperability discussed the design of a distributed system. A subsequent article on Supervision described the role of a supervisor as exercising control over such a distributed system; the supervisor leverages the controllability of the system's components.

This article elaborates more on the role of the observer, i.e. the component that collects the metrics issued by the components.

The observer leverages the Observability of the system's components, aggregates the observations, and provides an observable interface to the outside world.

As can be seen in Figure 1, the observer is part of the control plane of the system. Like the supervisor, it is not exposed to any business events that are passed along the services in the data plane.

Role of the Observer

As an observation aggregator, the observer hides the complexity of the distributed system and exposes a unified observable interface. Viewed from the outside, the observer represents the system, in the same way the supervisor represents the system. The difference between the two is that the supervisor plays an active role in keeping the system running, whereas the observer is much more passive. Together, the supervisor and the observer are a complete proxy of the distributed system.

When we think of system management as a monitor & manage loop, the observer deals with the monitoring, and the supervisor deals with the management aspect. This is how modern control tower implementations operate.

Using an observer/supervisor pair to represent the distributed system, the control tower does not have to be aware of the inner details of the distributed system. The control tower interacts with the observer and the supervisor as proxies.

Observer Responsibilities

Aggregating observations from the system's components is a key responsibility of the observer, but it's not its only responsibility. The single responsibility principle is violated here, much in the same way how it is violated by the supervisor.

Following are the essential observer responsibilities:

- Gathering local observations
- Drawing insights from local observations
- Publishing insights
- Monitoring the topology of the system

Gathering local observations

This is the easy part – subscribing to the metrics channel of the system is all it takes to gather the observations of all components of the system.

Drawing insights from local observations

Gathering observations is easy but drawing conclusions from those observations is much harder. Especially because the observer initially has no clue of what the system is intended to do, or how it is composed

At least two things are needed for the observer to do its job here:

The observations must be normalized to some extent, in other words the system's components must speak a language that the observer can understand, when observations are exchanged.
The observer must be aware of the structure (the topology) of the distributed system. For example, when two components A and B are connected using a channel C, and component A is no longer able to produce output (for any reason), the fact that channel C will get drained shortly and, following that, component B will fall idle, is a totally predictable cause & effect scenario. The observer needs to be aware of the flow of events in order to avoid raising superfluous warnings about the behavior of B.

With these requirements met, the observer is empowered to do a root cause analysis of many operational issues. And the observer is likely to be the only sensible component to do such an analysis.

As a side effect, the observer acts as a filter, ignoring metrics that reflect downstream behavior caused by upstream issues.

Publishing insights

Publishing insights is a special case of being observable. Remember that, in a distributed system, observability is realized by posting observations on a metrics channel. The observer is a subscriber to that channel, not a publisher, so it won't publish on the metrics channel.

To publish insights, the observer must make use of an external service interface. This could be a request/reply interface, but a streaming interface is more suited. Ideally, the control tower exposes a pub/sub channel for this.

Control Tower as a System of Systems

A control tower must be able to understand the insights (observations) that the observer publishes, hence some normalization is required at this level as well. Moreover, the control tower will need to be aware of the role of each system in the portfolio of systems it is managing. There appears to be a recursive need for a topology, i.e. a description of how the various (distributed) systems interoperate with each other.

Forwarding Local Observations

From time to time, the observer may be asked to expose (i.e. to forward) particular metrics from designated components. This is how controllable observability is applied.

In this scenario, the external consumer of the observations (i.e. the control tower) expects metrics in the format that is agreed with the systems it manages. Hence, the observer must transform the metrics from the internal format to the agreed external format, unless both worlds have standardized on the same format.

Monitoring the topology of the system

The primary role of the observer is to represent the distributed system as a single manageable unit. In doing so, the observer hides the complexity of the distributed system. But that complexity (i.e. its runtime topology) must be managed as well. The system's supervisor realizes the topology but receives little feedback after the initial realization (i.e. after the wiring phase). For this feedback, it relies on the observer.

Topology awareness

The observer needs to be aware of the actual system topology, if only to draw insights from local observations (see "Drawing insights from local observations" above). The supervisor provides a query interface for topology discovery, and the supervisor uses this to refresh its internal state of the topology.

Every time the topology changes, for example when additional component instances are deployed to improve throughput performance, the supervisor informs the observer via the control channel. The control event can be as simple as "The topology changed" or as complex as "Here's what changed just now: , , <…>".

Detecting & reporting component liveliness

Components that experience serious problems normally use the exception channel to inform the supervisor of any issues. That works fine until the component crashes. After that, there is only silence.

The observer, being aware of the system topology, will ultimately detect loss of heartbeat. This can be raised as an exception and transmitted to the supervisor using the exception channel. The supervisor can determine whether this is a critical scenario or not and based on that, the observer either posts an observation about the critical health issue or treats it as a configuration change.

Conclusions
In this paper, the role of an observer in a distributed system is elaborated. The value contribution of the observer is the creation of relevant observations about the overall system behavior, whilst suppressing any noise that is caused by internal system issues. The interaction between a distributed system and a control tower is considered as well, leading to the understanding of internal system control and external system management.

Earlier articles

Observability is King – October 2021
Controllability is Queen – November 2021
Interoperability is Chief of Staff – January 2022
Supervision is the Executive Branch – March 2022