
I strongly recommend watching/reading the entire report, or the summary by Sal Mercogliano of What's Going On In Shipping [0].
Yes, the loose wire was the immediate cause, but there was far more going wrong here. For example:
- The transformer switchover was set to manual rather than automatic, so it didn't automatically fail over to the backup transformer.
- The crew did not routinely train transformer switchover procedures.
- The two generators were both using a single non-redundant fuel pump (which was never intended to supply fuel to the generators!), which did not automatically restart after power was restored.
- The main engine automatically shut down when the primary coolant pump lost power, rather than using an emergency water supply or letting it overheat.
- The backup generator did not come online in time.
It's a classic Swiss Cheese model. A lot of things had to go wrong for this accident to happen. Focusing on that one wire isn't going to solve all the other issues. Wires, just like all other parts, will occasionally fail. One wire failure should never have caused an incident of this magnitude. Sure, there should probably be slightly better procedures for checking the wiring, but next time it'll be a failed sensor, actuator, or controller board.
If we don't focus on providing and ensuring a defense-in-depth, we will sooner or later see another incident like this.
Thanks for the summary for those of us who can't watch video right now.
There are so many layers of failures that it makes you wonder how many other operations on those ships are only working because those fallbacks, automatic switchovers, emergency supplies, and backup systems save the day. We only see the results when all of them fail and the failure happens to result in some external problem that means we all notice.
It seems to just be standard "normalization of deviance" to use the language of safety engineering. You have 5 layers of fallbacks, so over time skipping any of the middle layers doesn't really have anything fail. So in time you end up with a true safety factor equal only to the last layer. Then that fails and looking back "everything had to go wrong".
As Sidney Dekker (of Understanding Human Error fame) says: Murphy's Law is wrong - everything that can go wrong will go right. The problem arises from the operators all assuming that it will keep going right.
I remember reading somewhere that part of Qantas's safety record came from the fact that at one time they had the highest number of minor issues. In some sense, you want your error detection curve to be smooth: as you get closer to catastrophe, your warnings should get more severe. On this ship, it appeared everything was A-OK till it bonked a bridge.
This is the most pertinent thing to learn from these NTSB crash investigations - it's not what went wrong at the final disaster, but all the things that went wrong that didn't detect that they were down to one layer of defense.
Your car engaging auto brake to prevent a collision shouldn't be a "whew, glad that didn't happen" and more a "oh shit, I need to work on paying attention more."
The solution then is observability, to use the computing term; to know the state of every part of the system.
Oh, it gets even worse!
The NTSB also had some comments on the ship's equivalent of a black box. Turns out it was impossible to download the data while it was still inside the ship, the manufacturer's software was awful and the various agencies had a group chat to share 3rd party software(!), the software exported thousands of separate files, audio tracks were mixed to the point of being nearly unusable, and the black box stopped recording some metrics after power loss "because it wasn't required to" - despite the data still being available.
At least they didn't have anything negative to say about the crew: they reacted timely and adequately - they just didn't stand a chance.
It’s pretty common for black boxes to be load shed during an emergency. Kind of funny how that was allowed for a long time.
"they reacted timely and adequately" and yet: they're indefinitely restricted (detained isn't the right word, but you get it) to Baltimore, while the ship is free to resume service.
One of the things Sal Mercogliano stressed is that the crew (and possibly other crews of the same line) modified systems in order to save time.
Rather than doing the process of purging high-sulphur fuel that can't be used in USA waters, they had it set so that some of the generators were fed from USA-approved fuel, resulting in redundancy & automatic failover being compromised.
It seems probable that the wire failure would not have caused catastrophic overall loss of power if the generators had been in the normal configuration.
Also the zeroth failure mode: someone built a bridge that will collapse if any of the many many large ships that sail beneath it can't steer itself with high precision.
Ships were a lot smaller when the bridge was designed and built.
Right? There's an artificial island in that very harbor, which could be rammed by similar ships all day and give nary a fuck. It's called Fort Carroll and it was built in the *1850s*.
Why the bridge piers weren't set into artificial islands, I can't fathom. Sure. Let's build a bridge across a busy port but not make it ship-proof. The bridge was built in the 1970s, had they forgotten how to make artificial islands?
Obligatory: https://how.complexsystems.fail/
The problem is that there are a thousand merchant marine vessels operating right now that are all doing great - until the next loose wire. The problem is that nobody knows about that wire and it worked fine on the last trip. The other systems are all just as marginal as they were on the 'Dali' but that one shitty little wire is masking that.
Running a 'tight ship' is great when you have a budget to burn on excellent quality crew. But shipping is so incredibly cut-throat that the crew members make very little money, are effectively modern slaves and tend to carry responsibilities way above their pay grade. They did what they could, and more than that, and for their efforts they were rewarded with what effectively amounted to house arrest while the authorities did their thing. The NTSB of course will focus on the 'hard' causes. But you can see a lot of frustration shine through towards the owners who even in light of the preliminary findings had changed absolutely nothing on the rest of their fleet.
The recommendation to inspect the whole ship with an IR camera had me laughing out loud. We're talking about a couple of kilometers of poorly accessible duct work and cabinets. You can do that while in port, but while you're in port most systems are idle or near idle and so you won't ever find an issue like this until you are underway, when vibration goes up and power consumption shoots up compared to being in port.
There is no shipping company that is effectively going to do a sea trial after every minor repair, usually there is a technician from some supplier that boards the vessel (often while it is underway), makes some fix and then goes off-board again. Vessels that are not moving are money sinks so the goal is to keep turnaround time in port to an absolute minimum.
What should really amaze you is how few of these incidents there are. In spite of this being a regulated industry it is first and foremost an oversight failure, if the regulators would have more budget and more manpower there maybe would be a stronger drive to get things technically in good order (resist temptation: 'shipshape').
> But you can see a lot of frustration shine through towards the owners who even in light of the preliminary findings had changed absolutely nothing on the rest of their fleet.
Between making money, perceived culpability and risks offloaded to insurance companies why would they?
> The problem is that there are a thousand merchant marine vessels operating right now that are all doing great
Are they tho?
I generally think you have good takes on things, but this comes across like systemic fatalistic excuse making.
> The recommendation to inspect the whole ship with an IR camera had me laughing out loud.
Where did this come from? What about the full recommendations from the NTSB. This comment makes it seem like you are calling into question the whole of the NTSB's findings.
"Don't look for a villain in this story. The villain is the system itself, and it's too powerful to change."
https://en.wikipedia.org/wiki/Francis_Scott_Key_Bridge_colla...
> Between making money, perceived culpability and risks offloaded to insurance companies why would they?
Because it is the right thing to do, and the NTSB thinks so too.
>> The problem is that there are a thousand merchant marine vessels operating right now that are all doing great > Are they tho?
In the sense that they haven't caused an accident yet, yes. But they are accidents waiting to happen and the owners simply don't care. It usually takes a couple of regulatory interventions for such a message to sink in, what the NTSB is getting at there is that they would expect the owners to respond more seriously to these findings.
>> The recommendation to inspect the whole ship with an IR camera had me laughing out loud. > Where did this come from?
Page 58 of the report.
And no, obviously I am not calling into question the whole of the NTSB's findings, it is just that that particular one seems to miss a lot of the realities involving these vessels.
> "Don't look for a villain in this story. The villain is the system itself, and it's too powerful to change."
I don't understand your goal with this statement, it wasn't mine so the quotes are not appropriate and besides I don't agree with it.
Loose wires are a fact of life. The amount of theoretical redundancy is sufficient to handle a loose wire, but the level of oversight and the combination of ad-hoc work on these vessels (usually under great time pressure) together are what caused this. And I think that NTSB should have pointed the finger at those responsible for that oversight as well, which is 'MARAD', however, MARAD does not even rate a mention in the report.
You can also look at the problem from the perspective of the bridge. Why was it possible that a ship took it down? Motors can fail ...
It’s not realistically plausible to build bridges that won’t be brought down by that size of ship
Yes, but if you think of a ship once underway when the engine fails as an unguided ballistic missile with a mass that is absolutely mind boggling (the Dali masses 100,000 tonnes) there isn't much that you could build that would stop it. The best suggestion I've seen is to let the ship run aground but that ignores the situation around the area where the accident happened.
This ship wasn't towed by a tug, it was underway under its own power and in order for the ship to have any control authority at all it needs water flowing over the rudder.
Without that forward speed you're next to helpless and these things don't exactly turn on a dime. So even if there had been a place where it could have run aground it would never have been able to reach it because it was still in the water directly in front of the passage way under the bridge.
100,000 tonnes doing 7 Kph is a tremendous amount of kinetic energy.
The exact moment the systems aboard the Dali failed could not have come at a worse time, it had - as far as I'm aware of the whole saga - just made a slight course correction to better line up with the bridge and the helm had not yet been brought back to neutral. After that it was just Newton taking over, without control I don't think there is much that would have stopped it.
This is a good plot of the trajectory of the vessel from the moment it went under way until the moment it impacted the bridge:
https://www.pilotonline.com/wp-content/uploads/2024/03/5HVqi...
You can clearly see the kink in the trajectory a few hundred meters before it hit the bridge.
It's a tangent but I don't understand why the dock workers can unionize and earn livable wages but the crew cannot.
The fuel pump not automatically restarting on power loss may actually have been an intentional safety feature to prevent scenarios like pumping fuel into a fire in or around the generators. Still part of the Swiss cheese model, of course.
It wasn't. They were feeding generators 1 & 2 with the pump intended for flushing the lines while switching between different fuel types.
The regular fuel pumps were set up to automatically restart, which is why a set of them came online to feed generator 3 (which automatically spinned up after 1 & 2 failed, and wasn't tied to the fuel-line-flushing pump) after the second blackout.
I have found that 99% of all network problems are bad wires.
I remember that the IT guys at my old company, used to immediately throw out every ethernet cable, and replace them with ones right out of the bag; first thing.
But these ships tend to be houses of cards. They are not taken care of properly, and run on a shoestring budget. Many of them look like floating wrecks.
If I see a RJ45 plug with a broken locking thingie, or bare wires (not just bare copper - any internal wire), I chop the plug off.
If I come across a CATx (solid core) cable being used as a really long patch lead then I lose my shit or perhaps get a backbox and face plate and modules out along with a POST tool.
I don't look after floating fires.
I recently had a home network outage. The last thing I tested was the in-wall wiring because I just didn't think that would be the cause. It was. Wiring fails!
If I had a nickle for every time someone clobbered some critical connectivity with an ill-advised switch configuration I wouldn't have to work for a living.
And the physical layer issues I do see are related to ham fisted people doing unrelated work in the cage.
Actual failures are pretty damn rare.
That's true for almost all electronics. I worked on robotic arms for a few years - if things broke it was always the wiring (well, to be precise - the connectors).
The ship was 10 years old, not some WW2 hulk.
Another case study to add to the maritime chapter of this timeless classic: https://www.amazon.com/Normal-Accidents-Living-High-Risk-Tec...
Like you said (and illustrated well in the book) it's never just 1 thing, these incidents happen when multiple systems interact and often reflect a the disinvestment in comprehensive safety schemes.
Shipping, accidents and timeless classics.
I was sure you were going to link to Clarke and Dawe, The Front Fell off.
ive been in an environment like that.
"nuisance" issues like that are deferred bcz they are not really causing a problem, so maintenance spends time on problems with things that make money, rather than what some consider spit n polish on things that have no prior failures.
Tragically, it's the same with modern software development and the growth of technical debt.
Just insane how much criminal negligence went on. Even boeing hardly comes close. What needs to change is obviously a major review of how ships are allowed to operate near bridges and other infrastructure. And far stricter safety standards like aircraft face.
Hopefully the lesson from this will be received by operators: it's way cheaper to invest in personnel, training, and maintenance than to let the shit hit the fan.
Why? It's cost them 100M (https://www.justice.gov/archives/opa/pr/us-reaches-settlemen...) but rebuilding the bridge is going to be 5.2Billion so if gundecking all this maintenance for 20+ years has saved more then 100M, they will do it again.
From your article - this answered a question I had:
> The settlement does not include any damages for the reconstruction of the Francis Scott Key Bridge. The State of Maryland built, owned, maintained, and operated the bridge, and attorneys on the state’s behalf filed their own claim for those damages. Pursuant to the governing regulation, funds recovered by the State of Maryland for reconstruction of the bridge will be used to reduce the project costs paid for in the first instance by federal tax dollars.
Actually, to be even more cynical….
If everyone saved $100M by doing this and it only cost one shipper $100M, then of course everyone else would do it and just hope they aren’t the one who has bad enough luck to hit the bridge.
And statistically, almost all of them will be okay!
Isn't there a big liability insurance payout on this towards the 5.2 Billion, and if so won't the insurer be more motivated to mandate compliance?
The vessel owner may possibly be able to recover some of that from the manufacturer, as the wiring was almost certainly a manufacturing error, and maybe some of the configurations that continued the blackout were manufacturer choices as well.
I imagine every vessel has its own corporation that owns it which would declare insolvency if this kind of thing happens
It’s not thought. These situations are extremely rare. When they happen it just close the company and shed liability.
I watched Sal's video yesterday, great summary.
So much complexity, plenty of redundancy, but not enough adherence to important rules.
All you said is true - but these investigations are often used for the purpose of determining financial liability and often that comes down to figuring out that one, immediate, proximate thing that caused the accident.
A whole bunch of things might have gone wrong, but if only you hadn't done/not-done that one thing, we'd all be fine. So it's all your fault!
Respectfully, have you ever actually read an NTSB report? They're incredibly thorough and consider both causes and contributing factors through a number of lenses with an exclusive focus on preventing accidents from occurring.
Also, they're basically inadmissible in court [49 U.S.C.§1154(b)] so are useless for determining financial liability.
[dead]
Although I was never named to a mishap board, my experience in my prior career in aviation is that the proper way to look at things like this is that while it is valuable to identify and try to fix the ultimate root cause of the mishap, it's also important to keep in mind what we called the "Swiss cheese model."
Basically, the line of causation of the mishap has to pass through a metaphorical block of Swiss cheese, and a mishap only occurs if all the holes in the cheese line up. Otherwise, something happens (planned or otherwise) that allows you to dodge the bullet this time.
Meaning a) it's important to identify places where firebreaks and redundancies can be put in place to guard against failures further upstream, and b) it's important to recognize times when you had a near-miss, and still fix those root causes as well.
Which is why the "retrospectives are useless" crowd spins me up so badly.
> it's important to recognize times when you had a near-miss, and still fix those root causes as well.
I mentioned this principal to the traffic engineer when someone almost crashed into me because of a large sign that blocked their view. The engineer looked into it and said the sight lines were within spec, but just barely, so they weren't going to do anything about it. Technically the person who almost hit me could have pulled up to where they had a good view, and looked both ways as they were supposed to, but that is relying on one layer of the cheese to fix a hole in another, to use your analogy.
Likewise with decorative hedges and other gardenwork; your post brought to mind this one hotel I stay regularly where a hedge is high enough and close enough to the exit that you have to nearly pull into the street to see if there's oncoming cars. I've mentioned to the FD that it's gonna get someone hurt one day, yet they've done nothing about it for years now.
People love to rag on Software Engineers for not being "real" engineers, whatever that means, but American "Traffic Engineers" are by far the bigger joke of a profession. No interest in defense in depth, safety, or tradeoffs. Only "maximize vehicular traffic flow speed."
To be fair, there is no way to fix this in the general case—large vehicles and other objects may obstruct your view also. Therefore, you have to learn to be cognisant of line-of-sight blockers and to deal with them anyway. So for a not-terrible driver, the only problem that this presents is that they have to slow down. Not ideal, but not a safety issue per se.
That we allow terrible drivers to drive is another matter...
> there is no way to fix this in the general case—large vehicles and other objects may obstruct your view also
Vehicles are generally temporary. It is actually possible to ensure decent visibility at almost all junctions, as I found when I moved to my current country - it just takes a certain level of effort.
> Which is why the "retrospectives are useless" crowd spins me up so badly.
When I see complaints about retrospectives from software devs they're usually about agile or scrum retrospective meetings, which have evolved to be performative routines. They're done every sprint (or week, if you're unlucky) and even if nothing happens the whole team might have to sit for an hour and come up with things to say to fill the air.
In software, the analysis following a mishap is usually called a post-mortem. I haven't seen many complaints about those have no value. Those are usually highly appreciated. Thought some times the "blameless post-mortem" people take the term a little too literally and try to avoid exploring useful failures if they might cause uncomfortable conversations about individuals making mistakes or even dropping the ball.
Post mortems are absolutely key in creating process improvements. If you think about an organization's most effective processes, they are likely just representations of years of fixed errors.
Regarding blamelessness, I think it was W. Edwards Deming who emphasized the importance of blaming process over people, which is always preferable, but its critical for individuals to at least be aware of their role in the problem.
Agree. I am obligated to run those retrospectives and the SNR is very poor.
It is nice though (as long as there isn't anyone in there that the team is afraid to be honest in front of), when people can vent about something that has been pissing them off, so that I as their manager know how they feel. But that happens only about 15-20% of the time. The rest is meaningless tripe like "Glad Project X is done" and "$TECHNOLOGY sucks" and "Good job to Bob and Susan for resolving the issue with the Acme account"
>When I see complaints about retrospectives from software devs they're usually about agile or scrum retrospective meetings, which have evolved to be performative routines.
You mean to tell me that this comment section where we spew buzzwords and reference the same tropes we do for every "disaster" isn't performative.
this is essentially the gist of https://how.complexsystems.fail which has been circulating more with discussions of the recent AWS/Azure/Cloudflare outages.
> Swiss cheese model
I always thought that before the "Swiss cheese model" introduced in the 1990s that the term Swiss cheese was used to mean something that had oodles of security holes(flaws).
Perhaps I find the metaphor weird because pre-sliced cheese was introduced later in my life (processed slices were in my childhood, but not packets of pre-sliced cheese which is much more recent).
>Which is why the "retrospectives are useless" crowd spins me up so badly.
As Ops person, I've said that before when talking about software and it's mainly because most companies will refuse to listen to the lessons inside of them so why am I wasting time doing this?
To put it aviation terms, I'll write up something being like (Numbers made up) "Hey, V1 for Hornet loaded at 49000 pounds needs to be 160 knots so it needs 10000 feet for takeoff" Well, Sales team comes back and says NAS Norfolk is only 8700ft and customer demands 49000+ loads, we are not losing revenue so quiet Ops nerd!
Then 49000+ Hornet loses an engine, overruns the runway, the fireball I'd said would happen, happens and everyone is SHOCKED, SHOCKED I TELL YOU this is happening.
Except it's software and not aircraft and loss was just some money, maybe, so no one really cares.
> All the holes in the cheese line up...
I absolutely heard that in Hoover's voice.
Is there an equivalent to YouTube's Pilot Debrief or other similar channels but for ships?
As I said elsewhere, the upshot is that you need to know which holes the bullet went through so you can fix them. Accidents like this happen when someone does not (care to) know the state of the system.
> Basically, the line of causation of the mishap has to pass through a metaphorical block of Swiss cheese, and a mishap only occurs if all the holes in the cheese line up.
The metaphor relies on you mixing and matching some different batches of presliced Swiss cheese. In a single block, the holes in the cheese are guaranteed to line up, because they are two-dimensional cross sections of three-dimensional gas bubbles. The odds of a hole in one slice of Swiss cheese lining up with another hole in the following slice are very similar to the odds of one step in a staircase being followed by another step.
The three-dimensional gas bubbles aren't connected. An attacker has to punch through the thin walls to cross between the bubbles or wear and tear has to erode the walls over time. This doesn't fundamentally change anything.
Life tip: nitpicking a figure of speech is never useful and always makes you look like an arse.
No, it's a metaphor.
And there's the archetypal comment on technology-based social media that is simultaneously technically correct and utterly irrelevant to the topic at hand.
Note that "Don't make mistakes" is no more actionable for maintenance of a huge cargo ship than for your 10MLoC software project. A successful safety strategy must assume there will be mistakes and deliver safe outcomes nevertheless.
Obviously this is the standard line any disaster prevention, and makes sense 99% of the time. But what's the standard line about where this whole protocols-to-catch-mistakes thing bottoms out? Obviously people executing the protocol can make mistakes, or fall victim to normalization of deviance. The same is true for the next level of safety protocol you layer on top of that. At some level, the only answer really is just "don't make mistakes", right? And you're mostly trying to make sure you can do that at a level where it's easier to not make mistakes, like simpler decisions not under time pressure.
Am I missing something? I feel like one of us is crazy when people are talking about improving process instead of assigning blame without addressing the base case.
Normalization of deviance doesn't happen through people "making mistakes", at least not in the conventional sense. It's a deliberate choice, usually a response to bad incentives, or sometimes even a reasonable tradeoff.
I mean ultimately establishing a good process requires make good choices and not making bad ones, sure. But the kind of bad decisions that you have to avoid are not really "mistakes" the same way that, like, switching on the wrong generator is a mistake.
It kind of is though. There's a lot less opportunity for failures at the limit and unforeseen scale. Mechanical things also mostly don't keel over or go haywire with no warning.