(Originally published on O'Reilly Radar).
Today our livelihoods and our very lives depend on software. With all the benefits of a software-rich world comes unprecedented complexity. Our ability to reason about the systems that we’re working with (and are part of) diminishes as their scale and interdependence increases. We can no longer rely solely on past experience, and instead have to continuously discover how systems are functioning or failing, and adapt accordingly.
This continuous adaptation requires an ability to learn deeply. Learning is not optional; it is the lifeblood of complex systems, including modern companies. And yet, we often default to the comfortable but shallow learning that is undermined by blame and biases.
Take, for example, a recent security incident at Symantec, a Fortune 500 technology company. The investigation revealed the “root cause [to be] a violation by specific individuals of established policies.” In other words, the root cause is a “few outstanding employees”, the bad apples who, despite “stringent on-boarding and security trainings,” failed to follow processes and screwed things up.
What about the recent Volkswagen diesel emissions scandal?
"This was a couple of software engineers who put [the cheating software] in for whatever reason," Michael Horn, VW's U.S. chief executive, told a House subcommittee hearing. "To my understanding, this was not a corporate decision. This was something individuals did."
And how do we deal with these folks? In the above cases, we fire them. In other cases, we suspend them, transfer them, demote them, dock their pay, prevent them from doing their previous jobs…
How comfortable do these two stories feel so far? At first glance, they’re quite coherent: we found why this incident happened, and who is responsible. It feels that we’ve dealt with the perpetrators in a way that’s proportional to their transgressions. Justice has been served, and trust has been restored. And let’s not forget the clear message we’re sending to our organizations (and the world), because after all, as the VW executive says, “the findings of this investigation do not reflect the values or who we are as a company.” Both of these are open-and-shut cases, right?
If you’ve been working with complex systems, you might feel a slight discomfort right about now, a suspicion that, comfortable as they might first feel, both of these stories are too simplistic, and far from complete. You might ask: Have we figured out how these folks were able to violate the policy or get the cheating “lines of software code” to 11 million cars? How often does that happen? Are such actions still possible—are there safeguards or are the risks of error or misconduct inherent in the system? Have we captured what the employees were thinking, how working around policy or safeguards made sense given the information they had at the time? What other tradeoffs do employees make during the normal course of doing their jobs? In Symantec’s case, do we understand how they were able to achieve such a low error rate (only 0.023% since 1995) despite having the ability to disregard the same established policies all along? In VW’s case, how was the aberrant software not detected for so long?
The stories that emerge from these questions (“the infinite hows”) are far more nuanced and richer. In this version, there is no simple single “root cause”, but many conditions, each necessary but only jointly sufficient. These stories go beyond the obvious (the known knowns) to uncover the less obvious (the known unknowns). They also highlight the fact that there are likely unknown unknowns present—unpredictable risks that might be hidden until the next incident. In fact, we might even feel anxious because we’ve just escorted out of the building the individuals most familiar with these particular errors and systems, who could help discover the unknowns. And we’ve sent another clear message to the organization: whatever you do, don’t get caught.
Daniel Kahneman writes in “Thinking, Fast and Slow” that the mere “achievement of coherence and of the cognitive ease ... causes us to accept a statement as true.” Quickly jumping to conclusions is an amazing human ability, and it serves us well most of the time. But when we blame or fall under the influence of cognitive biases, all we have is just a story that feels good, not one that’s realistic or helpful. Blame and biases are errors in judgements that severely limit our learning, and contribute to the fragility of our complex systems.
To paraphrase David Kirkpatrick’s insight on a world being consumed by software, regardless of industry your company is now a learning company, and pretending that it’s not spells serious peril. Given this, we cannot continue to construct comfortable stories—essentially fairy tales, complete with villains we blame and punish, and simple conclusions we quickly jump to—because these simplistic stories short circuit our ability to learn. Instead, we can choose to build richer, more realistic narratives for the sake of learning.
How do we go beyond blame or bias to build true learning organizations? That is the central question of my book, "Beyond Blame: Learning from Failure and Success." It’s a story of an incident that threatens the very existence of a large financial institution, and the counterintuitive steps its leadership took to stop the downward spiral. Their approach relies on complexity science, resilience engineering, human factors, cognitive science, and organizational psychology. This approach allows us to identify more underlying conditions for failure, and make our systems (and organizations) safer and more resilient. It also enables us to turn our companies into learning companies.