A Software Observability Roundup
January 02, 2025
I spent some time recently catching up on my #to-read
saves in Obsidian. More than a few of these
were blog posts from 2024 about software observability. Talk of "redefining observability",
"observability 2.0", and "try Honeycomb" had caught my eye in a few spaces,
and so I had been hoarding links on the topic.
After spending a few days immersing myself in those articles and branching out to others, I decided to write this bullet-form roundup:
- for myself, as a way of solidifying my current understanding
- in public, as a way to invite corrections and improvements (drop a comment below or @parente.dev on Bluesky!)
- with my colleagues in mind, as a new way to approach and discuss an ever-green question:
As our issue space changes and grows, and our solutions adapt and scale in response, what (else) should we do today so that we can readily address unknown-unknowns tomorrow?
Overview
The seventeen references I surveyed offer perspectives on observability as it pertains both to software systems and organizations around them. They cover what observability is, what problems it solves, how it is and should be implemented. There's alignment from the authors on the state of affairs, learned best practices, and a direction in which the industry should head. Shared terminology and goals are works in progress.
Origins
- According to control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.1
- The discipline of software engineering (distributed computing, site reliability engineering, et al) has not settled on a single definition. One that stays close to the control theory original is that software observability measures how well a system's state can be understood from the obtained telemetry.1
- Metrics, logs, and traces caught on as the three kinds of telemetry required to observe a
software system—the so-called "three pillars of observability."
- ... perhaps because they helped build a shared vocabulary at the 2017 Distributed Tracing Summit.2
- ... perhaps because they do provide a comprehensive way for engineers to monitor systems for known problems and hint at where the issue lies.3
- ... perhaps because solutions for monitoring systems using metrics, logs, and traces are what vendors had to sell.4
Problems and Limitations
- The task of analyzing disjoint metrics, logs, and trace data falls on humans when using
three-pillar systems designed primarily for monitoring.5
- Moving beyond investigation of known-knowns is difficult without data and tooling designed to support correlations and experimentation.6
- Use of monitoring tools leads to org reliance on the intuition of a few system experts resulting in cognitive costs and bus-factor risks. Low visibility slows development and reduces team confidence.8
- Using CloudWatch logs, CloudWatch metrics, and X-Ray traces together, for example, requires users to infer answers to questions from their mental model of the system, incomplete data, disparate views, and reading of code.9
- The three-pillar data model constrains the types of questions that can be asked and answered,
with an almost exclusive focus on engineering concerns. Even mature observability programs will
struggle to answer questions of greater interest and value to the business3, such
as:
- What's the relationship between system performance and conversions, by funnel stage, broken down by geo, device, and intent signals?
- What's our cost of goods sold per request, per customer, with real-time pricing data of resources?
- How much does each marginal API request to our enterprise data endpoint cost in terms of availability for lower-tiered customers? Enough to justify automation work?
- There are many sources of truth when disparate formats (metrics, logs, traces) and/or tools are in play, with decisions made at write-time about how the data will be used in the future. 10
- The value of metrics, logs, and (un-sampled) traces does not scale with the costs required to
collect, transfer, and store them.5 As the bill goes up, the value stays constant
at best, and more likely decreases.8
- Logs get noisier and get slower to search with greater volume.
- Custom metrics require more forethought and auditing as the set grows over time.
-
"At the end, the three pillars of observability do not exist. It's not something we should be relying on."9
- The coexistence of metrics, logging, and tracing is not observability. They are telemetry useful in monitoring systems.7
Better Practices
-
Instrument applications to emit "wide events" (or "canonical logs" or "structured logs") as your telemetry data.
- Wide events have high-dimensionality (many attributes) and attributes with high-cardinality (many possible unique values) making them context-rich (everything about the event is attached to it).11
- "High-dimensionality" roughly equates with hundreds of attributes at present. Metadata about hosts, pods, builds, requests, responses, users, customers, timing, errors, teams, services, versions, third-party vendors, etc. are all fair game.12
-
Have a single source of truth which stores the wide events as they are emitted.
- Do no aggregation at write-time. Make decisions at read-time about how to query and use the data.10 11
- Wide events from a service continuously handling 1000 requests per second—about 1 million events per day—can compress to about 80 MB in columnar formats like Parquet and cost pennies to retain for a few months in typical object stores.12
- Custom metrics are effectively infinite as costs no longer increase linearly (thanks to columnar data storage) and the ability to cross-correlate increases as more event attributes are added. Intelligent sampling can control volume costs associated with these structured events when scale demands it.8
- Storing event data in one place lends itself to the application of AI-tools which are good at correlating and summarizing14, perhaps continually in the background.9
-
Adopt tooling that lets you explore quickly and cheaply—to discover emergent behaviors, answer new questions, prepare for unknown unknowns.
- Proper tooling allows engineers to investigate any system, regardless of their experience with it or its complexity, in a methodical and objective manner.13
- The waterfall view of traces, root spans, nested spans, and the like is not sufficient. Users need the ability to "dig" into data however they deem necessary.14
- You will never ask the same question twice. Something is different since you last asked it.15
- There is a natural tension between a system’s scalability and its feature set. You can afford more powerful observability features at scales orders of magnitude smaller than Google.5
Looking Forward
-
Confusion abounds about what observability really is14 to the point that folks are actively redefining it15 3 or versioning it4 16 to improve clarity.
- "Pretty much everything in business is about asking questions and forming hypotheses, then testing them." That's observability.3
- The cognitive systems engineering definition of observability—feedback that provides insight into a process and refers to the work needed to extract meaning from available data—may be a better starting point for software engineering.15
- "Observability is the process through which one develops the ability to ask meaningful questions, get useful answers, and act effectively on what you learn." It is not a tooling problem but rather a strategic capability akin to business intelligence.15
- "Observability 2.0 has one source of truth, wide structured log events, from which you can derive all the other data types." The benefit to the full software development lifecycle, the cost model, and the adoption by a critical mass of developers make observability 2.0 inevitable.10
- "Observability 1.0 gave us lots of useful answers, observability 2.0 gives us the potential to ask meaningful questions, and observability 3.0 is going to give us the ability to act effectively on what we learn."16
-
There is consensus on the direction in which software observability should head: toward the better practices mentioned earlier. Discussion continues to establish shared language and goals.
References
-
Observability (software). (2024, May 24). In Wikipedia. ↩↩
-
Bourgon, P. (2017, February 21). Metrics, tracing, and logging. Peter Bourgon's Blog. ↩
-
Parker, A. (2024, March 29). Re-Redefining Observability. Austin Parker's Blog. ↩↩↩↩
-
Majors, C. (2024, August 7). Is It Time To Version Observability? (Signs Point To Yes). charity.wtf. ↩↩
-
Sigelman, B. (2021, February 4). Debunking the 'Three Pillars of Observability' Myth. Software Engineering Daily. ↩↩↩
-
Weakly, H. (2024, October 3). The 4 Evolutions of Your Observability Journey. The New Stack. ↩
-
Sigelman, B. (2021, February 4). Observability Won’t Replace Monitoring (Because It Shouldn’t). The New Stack. ↩
-
Majors, C. (2024, January 24). The Cost Crisis in Observability Tooling. Honeycomb Blog. ↩↩↩
-
Tane, B. & Galbraith, K. (2024, December 6). Observing Serverless Applications (SVS212) [Conference presentation]. AWS re:Invent 2024 Las Vegas, Nevada, United States. ↩↩↩
-
Majors, C. (2024, November 19). There Is Only One Key Difference Between Observability 1.0 and 2.0. Honeycomb Blog. ↩↩↩
-
Tane, B. (2024, September 8). Observability Wide Events 101. Boris Tane's Blog. ↩↩
-
Morrell, J. (2024, October 22). A Practitioner's Guide to Wide Events. Jeremy Morrell's Blog. ↩↩
-
Majors, C., Fong-Jones, L., & Miranda, G. (2022, May 6). Observability Engineering: Achieving production excellence. O’Reilly Media, Inc. ↩
-
Burmistrov, I. (2024, February 15). All you need is Wide Events, not "Metrics, Logs and Traces". A Song Of Bugs And Patches. ↩↩↩
-
Weakly, H. (2024, March 15). Redefining Observability. Hazel Weakly's Blog. ↩↩↩↩
-
Weakly, H. (2024, December 9). The Future of Observability: Observability 3.0. Hazel Weakly's Blog. ↩↩↩
-
Majors, C. (2024, December 20). On Versioning Observabilities (1.0, 2.0, 3.0…10.0?!?). charity.wtf. ↩
Contact
GitHub
LinkedIn
RSS