Lenses

Oct 24, 2024

The Problem

Modern IT systems have historically increased their internal complexity over time. Since the early days, engineers have built larger and increasingly bespoke systems, layering abstractions on top of each other through intricate trade-offs to achieve properties such as high availability or simply to fulfill business needs. Complexity is neither inevitable nor an end in itself; it is a necessary byproduct of how the IT industry implements solutions.

This trend has compelled engineers—those who directly interact with machines—to develop increasingly complex mental models to reason about, operate, and further develop the systems for which they are responsible. During incidents, a mismatch between an engineer’s mental model and some part of the actual system’s state or structure consumes most of the incident’s duration. In some cases, this mismatch may even cause an incident or security vulnerability. Understanding a system often involves parsing information and forming a mental model of the relevant details, which is only preliminary work.¹

Incomplete Solutions

The continuous rise in system complexity has not been met with a similar investment in secondary systems that aid engineers in comprehending the primary systems they operate. While monitoring, alerting solutions, and various visualization tools have been developed and refined over the years, they have never truly caught up with the complexity of primary systems. Consequently, the effort required to build an ad-hoc mental model of a given system keeps increasing.

When we observe modern science fiction TV shows or movies, we often see characters operating systems of complexity akin to that of current IT systems. However, the average viewer must develop a minimal mental model of the system’s state to follow the story. As a result, the visualization of complex sci-fi systems differs vastly from the everyday visualizations of actual systems of similar complexity. The language used is more obvious, more direct, and better designed, aimed at quick intuitive clarity. We make it easy for TV viewers to understand, yet we do not invest the same effort when real systems depend on human understanding.

The visualization of contemporary IT systems has lagged behind their actual complexity because engineers have grown accustomed to constructing a mental image of the system based on visualizations, rather than expecting the visualizations themselves to be readily consumable and interpretable. We often perceive developing a basic understanding of a given system on the spot as part of the job, whereas this process could be delegated to machines if we trained them to assist us more effectively.

Additionally, secondary systems are typically maintained separately from primary systems. Documentation is created before or after changes but is not automatically generated by the same mechanism that implements the change. Monitoring configurations are edited manually for each individual resource instead of utilizing libraries of reusable components. Even in cases where monitoring configurations are automatically created, this is rarely the case for other visualization systems.

To bridge the gap between systems and their visualizations, which serve as tools for understanding, we must treat visualization not as a separate system but as something derived from the same source as the actual system. We must also strive to make the representations as easily understandable as possible, minimizing the effort required to build a mental model.

A Model

IT systems essentially consist of a structure—the actual components that make up the specific system. This could be a database comprising a set of virtual machines with attached volumes, residing within a subnet. The structures are derived from a machine-readable definition, for example, created by a pipeline running Terraform that applies a manifest. Over time, the operation of these structures leads to a state, which includes metrics such as memory usage, disk space utilization, and more at any given moment. We also typically have a set of assumptions about our structures, such as the expectation that a database’s free disk space will remain above a certain threshold for it to function properly.

In most organizations, the following secondary systems are implemented and operated to support any given primary system:

Monitoring
Metrics
Documentation

Each of these systems serves as a form of visualization, helping human engineers construct a mental model of the primary system. They are all visualizations of the same underlying reality, simply different lenses through which to view the same system.

Visualizing only the bare structure itself is referred to as documentation.
Visualizing the structure’s state at a given time is known as metrics.
Visualizing whether assumptions apply to the current state is called monitoring.

All of these forms represent some kind of visualization.

Diagram of concepts

Proposed Solution

I propose that engineers begin to view monitoring, metrics, and documentation as only slightly differing visualizations of the same primary system—lenses on the same object. The visualizations should be developed as part of the structural elements they represent.

Describe and edit the definition in a machine-readable, versionable way.
- Add explanatory annotations to specific units of the definition. Make them available as metadata for the generated structure.
- Incorporate assumptions about state during the structure’s lifecycle.
Generate all structures from this definition using a reproducible, idempotent mechanism.
Generate all visual representations from this definition using a reproducible, idempotent mechanism.
Prioritize the creation of visual, intuitively understandable representations of your structure and its behavior. Design this so that the mental model and structure (including its behavior) overlap as much as possible.

Example

Describe and edit the definition in a machine-readable, versionable way.

Suppose we want to build a minimal system consisting of two virtual machines: a web application and its database, running on a virtualization host. We create a Git repository that holds Terraform manifests describing all elements (structures) for this setup, including instances, volumes, subnets, etc.

Add explanatory annotations to specific units of the definition. Make them available as metadata for the generated structure.

We add annotations to the Terraform manifests that are rendered into metadata fields of the actual cloud instances. These annotations can include short human-readable comments about the objects they belong to, and they also get included in the rendered documentation and appear in the monitoring diagrams.

Incorporate assumptions about state during the structure’s lifecycle.

We add annotations to the individual components of the manifests, for example, specifying the level of disk space usage at which the system is expected to fail.

The system’s state must be measured; everything else can be derived from the definition. Thus, the definition should encompass everything we know about the system prior to its operation.

Generate all structures from this definition using a reproducible, idempotent mechanism.

We integrate pipelines into the Git project in GitLab that utilize Terraform to provision the virtual machines.

Generate all visual representations from this definition using a reproducible, idempotent mechanism.

This pipeline should also add representations of all structures to relevant systems. It should automatically update the operational documentation in the wiki, including the architectural diagram, and automatically add or modify the structure’s settings in the metrics collection system, adjusting thresholds in the monitoring system as needed.

Prioritize the creation of visual, intuitively understandable representations of your structure and its behavior. Design this so that the mental model and structure (including its behavior) overlap as much as possible.

Work on how to visualize structures you frequently use. Decide on a standard method for visualizing a subnet and its current state. Implement that decision in code and use it to render your system’s visualizations. We can use Terraform modules to abstract recurring patterns, defining the components that a standard virtual machine consists of (volumes, memory, CPU cores, etc.) and build applications from these building blocks. We ensure that every module includes a visual representation. A module for a virtual machine contains definitions for visualizing it in the monitoring system, ensuring that all virtual machines appear consistently. For instance, a pale green octagon can represent a virtual machine. A high system load might gradually turn the machine red, allowing engineers to easily identify an overloaded instance within a cluster just by looking at its visualization. A network graph of communicating HTTP services could depict the percentage of 4xx replies with red edges and the number of requests reflected in the size of those edges. This design approach should cater to what feels intuitive for the engineers who need to understand the system. They shouldn’t have to manually pull data and verify hypotheses; the visualizations should enable reasoning about the system.

Over time, we develop a library of modules to build systems and render their representations into secondary visualization systems. Engineers may start to discuss what colors should represent specific states, indicating they have offloaded their mental modeling to the visualizations. Since the representation is derived from a common source, any changes to better align the visuals with mental models can be handled in a merge request and discussed among engineers, just like any other review.

Conclusion

Modern IT systems and their associated representations, which engineers rely on to build mental models necessary for operation, have diverged over time. Systems should be constructed from standardized building blocks, which should include not only definitions for producing the structures necessary to form the system but also definitions for representing these structures across various visualization systems, such as monitoring and documentation.

When structures and their representations originate from the same source, they evolve in sync. Visualizations are critical for understanding and operating IT systems and must be treated as first-class configuration items. This shift would focus engineering efforts toward a more abstract, yet intuitive and clear, understanding of systems and their behaviors.

Instead of attempting to piece together mental models from fragmented views, we should invest in solutions that facilitate an intuitive understanding of systems.

Thanks to Jakob and Felix for providing valuable input.

Experience will significantly accelerate this process, thus becoming a major source of pride for individual engineers. ↩︎