Tag Archives: Resilience

Early detection, fast recovery, early exploitation

Danger was the safest thing in the world if you went about it right

I am now contributing safety thoughts and ideas on safetydifferently.com. Here is a reprint of my initial posting. If you wish to add a comment, I suggest you first read the other comments at safetydifferently.com and then includes yours at the end to join the conversation.

Danger was the safest thing in the world if you went about it right

This seemingly paradoxical statement was penned by Annie Dillard. She isn’t a safety professional nor a line manager steeped in safety experiences. Annie is a writer who in her book The Writing Life became fascinated by a stunt pilot, Dave Rahm.

“The air show announcer hushed. He had been squawking all day, and now he quit. The crowd stilled. Even the children watched dumbstruck as the slow, black biplane buzzed its way around the air. Rahm made beauty with his whole body; it was pure pattern, and you could watch it happen. The plane moved every way a line can move, and it controlled three dimensions, so the line carved massive and subtle slits in the air like sculptures. The plane looped the loop, seeming to arch its back like a gymnast; it stalled, dropped, and spun out of it climbing; it spiraled and knifed west on one side’s wings and back east on another; it turned cartwheels, which must be physically impossible; it played with its own line like a cat with yarn.”

When Rahm wasn’t entertaining the audience on the ground, he was entertaining students as a geology professor at Western Washington State College. His fame to “do it right “ in aerobatics led to King Hussein recruiting him to teach the art and science to the Royal Jordanian stunt flying team. While in Jordan performing a maneuver, Rahm in his plane plummeted to the ground and burst into flames. The royal family and Rahm’s wife and son were watching. Dave Rahm was instantly killed.

After years and years of doing it right, something went wrong for Dave Rahm. How could have this happen? How can danger be the safest thing? Let’s turn our attention towards Resilience Engineering and the concept of Emergent Systems. By viewing Safety as an emergent property of a complex adaptive system, Dillard’s statement begins to make sense.

Clearly a stunt pilots pushes the envelope by taking calculated risks. He gets the job done which is to thrill the audience below. Rahm’s maneuver called “headache” was startling as the plane stalled and spun towards earth seemingly out of control. He then adjusted his performance to varying conditions to bring the plane safely under control. He wasn’t pre-occupied with what to avoid and what not to do. He knew in his mind what was the right thing to do.

Operating pointWe can apply Richard Cook’s modified Rasmussen diagram to characterize this deliberate moving the operating point towards failure but taking action to pull back from the edge of failure. As the op point moves closer to failure, conditions change enabling danger as a system property to emerge. To Annie Dillard this aggressive head into, pulling back action was how Danger was the safest thing in the world if you went about it right.

“Rahm did everything his plane could do: tailspins, four-point rolls, flat spins, figure 8’s, snap rolls, and hammerheads. He did pirouettes on the plane’s tail. The other pilots could do these stunts, too, skillfully, one at a time. But Rahm used the plane inexhaustibly, like a brush marking thin air.”

The job was to thrill people with acts that appeared dangerous. And show after show Dave Rahm pleased the crowd and got the job done. However, on his fatal ride, Rahm and his plane somehow reached a non-linear complexity phenomenon called the tipping point, a point of no return, and sadly paid the final price.

Have you encountered workers who behave like stunt pilots? A stunt pilot will take risks and fly as close to the edge as possible. If you were responsible for their safety or a consultant asked to make recommendations, what would you do? Would you issue a “cease and desist” safety bulletin? Add a new “safety first…”rule to remove any glimmers of workplace creativity? Order more compliance checking and inspections? Offer whistle-blowing protection? Punish stunt pilots?

On the other hand, you could appreciate a worker’s willingness to take risks, to adjust performance when faced with unexpected variations in everyday work. You could treat a completed task as a learning experience and encourage the worker to share her story. By showing Richard Cook’s video you could make stunt pilots very aware of the complacency zone and over time, how one can drift into failure. This could lead to an engaging conversation about at-risk vs. reckless behaviour.

How would you deal with workers who act as stunt pilots? Command & control? Educate & empower? Would you do either/or? Or do both/and?

When thinking of Safety, think of coffee aroma

CoffeeSafety has always been a hard sell to management and to front-line workers because, as Karl Weick put forward, Safety is a dynamic non-event. Non-events are taken for granted. When people see nothing, they presume that nothing is happening and that nothing will continue to happen if they continue to act as before.

I’m now looking at Safety from a complexity science perspective as something that emerges when system agents interact. An example is aroma emerging when hot water interacts with dry coffee grinds. Emergence is a real world phenomenon that System Thinking does not address.

Safety-I and Safety-II do not create safety but provide the conditions for Safety to dynamically emerge. But as a non-event, it’s invisible and people see nothing. Just as safety can emerge, so can danger as an invisible non-event. What we see is failure (e.g., accident, injury, fatality) when the tipping point is reached. We can also reach a tipping point when we do much of a good thing. Safety rules are valuable but if a worker is overwhelmed by too many, danger in terms of confusion, distraction can emerge.

I see great promise in advancing the Safety-II paradigm to understand what are the right things people should be doing under varying conditions to enable safety to emerge.

For further insights into Safety-II, I suggest reading Steven Shorrock’s posting What Safety-II isn’t on Safetydifferently.com. Below are my additional comments under each point made by Steven with a tie to complexity science. Thanks, Steven.

Safety-II isn’t about looking only at success or the positive
Looking at the whole distribution and all possible outcomes means recognizing there is a linear Gaussian and a non-linear Pareto world. The latter is where Black Swans and natural disasters unexpectedly emerge.

Safety-II isn’t a fad
Not all Safety-I foundations are based on science. As Fred Manuelle has proven, Heinrich’s Law is a myth. John Burnham’s book Accident Prone offers a historical rise and fall of the accident proneness concept. We could call them fads but it’s difficult to since they have been blindly accepted for so long.

This year marks the 30th anniversary of the Santa Fe Institute where Complexity science was born. At the May 2012 Resilience Lab I attended, Erik Hollnagel and Richard Cook introduced the RMLA elements of Resilience engineering: Respond, Monitor, Learn, Anticipate. They fit with Cognitive-Edge’s complexity view of Resilience: Fast recovery (R), Rapid exploitation (M,L), Early detection (A). This alignment had led to one way to operationalize Safety-II.

Safety-II isn’t ‘just theory’
As a pragmatist, I tend to not use the word “theory” in my conversations. Praxis is more important to me instead of spewing theoretical ideas. When dealing with complexity, the traditional Scientific Method doesn’t work. It’s not deductive nor inductive reasoning but abductive. This is the logic of hunches based on past experiences  and making sense of the real world.

Safety-II isn’t the end of Safety-I
The focus of Safety-I is on robust rules, processes, systems, equipment, materials, etc. to prevent a failure from occurring. Nothing wrong with that. Safety-II asks what can we do to recover when failure does occur plus what can we do to anticipate when failure might happen.

Resilience can be more than just bouncing back. Why return to the same place only to be hit again? Early exploitation means finding a better place to bounce to. We call it “swarming” or Serendipity if an opportunity unexpectedly arises.

Safety-II isn’t about ‘best practice’
“Best” practice does exist but only in the Obvious domain of the Cynefin Framework. It’s the domain of intuition and the Thinking Fast in Daniel Kahneman’s book Thinking Fast and Slow. What’s the caveat with best practices? There’s no feedback loop. So people just carry on as they did before.  Some best practices become good habits. On the other hand, danger can emerge from the baddies and one will drift into failure.

Safety-II and Resilience is about catching yourself before drifting into failure. Being alert to detect weak signals (e.g., surprising behaviours, strange noises, unsettling rumours) and having physical systems and people networks in place to trigger anticipatory awareness.

Safety-II isn’t what ‘we already do’
“Oh, yes, we already do that!” is typically expressed by an expert. It might be a company’s line manager or a safety professional. There’s minimal value challenging the response.  You could execute an “expert entrainment breaking” strategy. The preferred alternative? Follow what John Kay describes in his book Obliquity: Why Our Goals are Best Achieved Indirectly.

Don’t even start by saying “Safety-II”. Begin by gathering stories and making sense of how things get done and why things are done a particular way. Note the stories about doing things the right way. Chances are pretty high most stories will be around Safety-I. There’s your data, your evidence that either validates or disproves “we already do”. Tough for an expert to refute.

Safety-II isn’t ‘them and us’
It’s not them/us, nor either/or, but both/and.  Safety-I+Safety-II. It’s Robustness + Resilience together.  We want to analyze all of the data available, when things go wrong and when things go right.

The evolution of safety can be characterized by a series of overlapping life cycle paradigms. The first paradigm was Scientific Management followed by the rise of Systems Thinking in the 1980s. Today Cognition & Complexity are at the forefront. By honouring the Past, we learn in the Present. We keep the best things from the previous paradigms and let go of the proven myths and fallacies.

Safety-II isn’t just about safety
Drinking a cup of coffee should be a total experience, not just tasting of the liquid. It includes smelling the aroma, seeing the Barista’s carefully crafted cream design, hearing the first slurp (okay, I confess.) Safety should also be a total experience.

Safety can emerge from efficient as well as effective conditions.  Experienced workers know that a well-oiled, smoothly running machine is low risk and safe. However, they constantly monitor by watching gauges, listening for strange noises, and so on. These are efficient conditions – known minimums, maximums, and optimums that enable safety to emerge. We do things right.

When conditions involve unknowns, unknowables, and unimaginables, the shift is to effectiveness. We do the right things. But what are these right things?

It’s about being in the emerging Present and not worrying about some distant idealistic Future. It’s about engaging the entire workforce (i.e., wisdom of crowds) so no hard selling or buying-in is necessary.  It’s about introducing catalysts to reveal new work patterns.  It’s about conducting small “safe-to-fail” experiments to  shift the safety culture. It’s about the quick implementation of safety solutions that people want now.

Signing off and heading to Starbucks.

Safety-I + Safety-II

At a July 03 hosted conference Dave Snowden and Erik Hollnagel shared their thoughts about safety. Dave’s retrospects of their meeting are captured in his blog posting. Over the next few blogs I’ll be adding my reflections as a co-developer of Cognitive-Edge’s Creating and Leading a Resilient Safety Culture course.

Erik introduced Safety-II to the audience, a concept based on an understanding of what work actually is, rather than what it is imagined to be. It involves placing more focus on the everyday events when things go right rather than on errors, incidents, accidents when things go wrong. Today’s dominating safety paradigm is based on the “Theory of Error”. While Safety-I thinking has advanced safety tremendously, its effectiveness is waning and is now on the downside of the S-curve. Erik’s message is that we need to escape and move to a different view based on the “Theory of Action”.

Erik isn’t alone. Sidney Dekker’s latest presentation on the history of safety reinforces how little safety thinking has changed and how we are plateauing. Current programs such as Hearts & Minds continue to assume people have physical, mental, and moral shortcomings as was done way back in the early 1900s.

Dave spoke about Resilience and why the is critical as its the outliers where you find threat and opportunity. In our CE safety course, we refer to the Safety-I events that help prevent things from going wrong as Robustness. This isn’t an Either/Or situation but a Both/And. You need both Robustness + Resilience.

As a young electrical utility engineer, the creator of work-as-imagined, I really wanted feedback but struggled obtaining it. It wasn’t until I developed a rapport with the workers was I able to close the feedback loop to make me a better designer. Looking back I realize how fortunate I was since the crews were in proximity and exchanges were eye-to-eye.

During these debriefs I probably learned more from the “work-as-done” stories. I was told changes were necessary due to something that I had initially missed or overlooked. But more often it was due to an unforeseen situation in the field such as a sudden shift in weather or unexpected interference from other workers at the job site. Crews would make multiple small adjustments to accommodate varying conditions without fuss, bother, and okay, the occasional swear word.

I didn’t know it then but I know now: these were adjustments one learns to anticipate in a complex adaptive system. It was also experiencing Safety-II and Resilience in action in the form of narratives (aka stories).

When a disaster happens, look for the positive

In last month’s blog I discussed Fast Recovery and Swarming as 2 strategies to exit the Chaotic Domain. These are appropriate when looking for a “fast answer”. A 3rd strategy is asking a “slow question.”

Resilience as Cynefin DynamicsWhile the process flow through the Cynefin Framework is similar to Swarming (Strategy B), the key difference is not looking for a quick solution but attempting to understand the behaviour of agents (humans, machines, events, ideas). The focus is on identifying something positive emerging from the disaster, a serendipitous opportunity worth exploiting.

By conducting safe-to-fail experiments, we can probe the system, monitor agent behaviour, and discover emerging patterns that may lead to improvements in culture, system, process, structure.

Occasions can arise when abductive thinking could yield a positive result. In this type of reasoning, we begin with some commonly well known facts that are already accepted and then works towards an explanation. The vernacular would be playing a hunch.

Snowstorm Repairs

In the electric utility business when the “lights go out”, a trouble crew  is mobilized and the emergency restoration process begins. Smart crews are also on the lookout for serendipitous opportunities. One case involved a winter windstorm causing  a tree branch to fall across the live wires. Upon restoration, the crew leader took it upon himself to contact customers affected by the outage to discuss removal of other potentially hazardous branches. The customers were very willing and approved the trimming. The serendipity arose because these very same customers vehemently resisted in the Fall to have their trees trimmed as part of the routine vegetation maintenance program.  The perception held then was that the trees were in full bloom and aesthetically pleasing; the clearance issues were of no concern. Being out of power for a period of time in the cold winter can shift paradigms.

When a disaster happens, will it be fast recovery or swarming?

Last month’s blog was about Act in the Cynefin Framework’s Chaotic domain.  Be aware you cannot remain in the Chaotic domain as long as you want. If you are not proactively trying to get out it, somebody or something else will be taking action as Asiana Airlines learned.

How you decide to Sense and Respond? We can show 2 proactive strategies:

Resilience as Cynefin DynamicsStrategy A is a fast recovery back to the Ordered side. It assumes you know what went wrong and have a solution documented in a disaster plan ready to be executed.

If it’s not clearly understand what caused the problem and/or you don’t have a ready-made solution in place,  then Strategy B is preferred. This is a “swarming” strategy perfected by Mother Nature’s little creatures, in particular, ants.

AntsIf the path to a food supply is unexpectedly blocked, ants don’t stop working and convene a meeting like humans do. There are no boss ants that command and control. Individual ants are empowered to immediately start probing to find a new path to the food target. Not just one ant, but many participate. Once a new path is found, communication is quickly passed along and a new route is established.

This is Resilience – the ability to bounce back after taking a hit. 

When a disaster happens, how fast do you act?

In the Cynefin framework, we place unexpected negative events into the Chaotic domain. The solution methodology is to Act-Sense-Respond. When a disaster produces personal injuries and fatalities, Act is about initially rendering the situation as safe as possible and stabilizing conditions to prevent additional life-threatening events from occurring.

Whenever a disaster happens, we go into “damage control” mode. We think were in control because we determine what information will be released, when and by whom. Distributing information to the right channels is a key action under Act. We try our best to limit the damage not only to our people and equipment but to our brand, reputation, and credibility. In other terms, we attempt to protect our level of trust with customers/clients, media, general public.

In the latter stages of the 20th century, breakthroughs in information technology meant we had to learn how to quickly communicate because news traveled really fast. In today’s 21st century, news can spread even faster, wider, and cheaper by anyone who can tweet, upload a Facebook or Google+ photo, blog, etc. The damage control window has literally shrunk from hours to minutes to seconds.

This month we sadly experienced a tragedy at SFO when Asian Airlines flight 214 crashed. I recently reviewed slides produced by SimpliFlying, an aviation consultancy focused on crisis management. Their 2013 July 06 timeline of events is mind boggling:

11:27am: Plane makes impact at SFO
11.28am: First photo from a Google employee boarding another flight hits Twitter (within 30 secs!)
11.30am: Emergency slides deployed
11.45am: First photo from a passenger posted on Path, Facebook and Twitter
11.56am: Norwegian journalists asks for permission to use photo from first posters. Tons of other requests follow
1.20pm: Boeing issues statement via Twitter
2.04pm: SFO Fire Department speaks to the press
3.00pm: NTSB holds press conference, and keeps updating Twitter with photos
3.39pm: Asiana Airlines statement released
3.40pm: White House releases statement
8.43pm: First Asiana Press release (6.43am Korea time)

Although Asiana Airlines first Facebook update was welcomed, they did not provide regular updates and didn’t bother replying to tweets. Bottom line was their stock price and brand took huge hits. Essentially they were ill prepared to Act properly.

“In the age of the connected traveller, airlines do not have 20 minutes, but rather 20 seconds to respond to a crisis situation. Asiana Airlines clearly was not ready for this situation that ensued online. But each airline and airport needs to build social media into its standard operating procedures for crises management.”

If you encounter a disaster, how fast are you able to act? Does your emergency restoration plan include social media channels? Do you need to rewrite your Business Disaster Recovery SOPs?

If you choose to revisit or rewrite, what paradigm will you be in? If it’s Systems Thinking, your view is to control information. Have little regard for what others say and only release information when you are ready. Like Asiana Airlines.  If you’re in the Complexity & Sense-Making paradigm, you realize you cannot control but only can influence. You join and participate in the connected network that’s already fast at work commenting on your disaster.

That’s Act. How you decide to Sense and Respond will be subsequently covered.