Story Detail of id 47453875 | Liveview Hacker News

drob5188 hours ago | on: Entso-E final report on Iberian 2025 blackout

Frequently, when you see these massive failures, the root cause is an alignment of small weaknesses that all come together on a specific day. See, for instance, the space shuttle O-ring incident, Three-Mile Island, Fukushima, etc. These are complex systems with lots of moving parts and lots of (sometimes independent) people managing them. In a sense, the complexity it the common root cause.

burningChrome1 hour ago | parent | next

This is the same thing that happened with the 35W bridge collapse in Minneapolis. The gusset plates after the disaster were examined and found to be only 1/2" thick when the original design called for them to actually be 1" thick. The bridge was a ticking time bomb since the day it was built in 1967.

As the years went on, the bridge's weight capacity was slowly eroded by subsequent construction projects like adding thicker concrete deck overlays, concrete median barriers and additional guard rail and other safety improvements. This was the second issue, lining up with the first issue of thinner gusset plates.

The third issue that lined up with the other two was the day of the bridges failure. There were approximately 300 tons of construction materials and heavy machinery parked on two adjacent closed lanes. Add in the additional weight of cars during rush hour when traffic moved the slowest and the bridge was a part of a bottleneck coming out of the city. That was the last straw and when the gusset plates finally gave way, creating a near instantaneous collapse.

linuxguy27 hours ago | parent | next

It's like the Swiss Cheese model where every system has "holes" or vulnerabilities, several layers, and a major incident only occurs when a hole aligns through all the layers.

https://en.wikipedia.org/wiki/Swiss_cheese_model

Ringz7 hours ago | root | parent

I use this model all the time. It's very helpful for explaining the multifactorial genesis of catastrophes to ordinary people.

anonymars7 hours ago | root | parent

Also perhaps worth a read:

https://devblogs.microsoft.com/oldnewthing/20080416-00/?p=22...

"You’ve all experienced the Fundamental Failure-Mode Theorem: You’re investigating a problem and along the way you find some function that never worked. A cache has a bug that results in cache misses when there should be hits. A request for an object that should be there somehow always fails. And yet the system still worked in spite of these errors. Eventually you trace the problem to a recent change that exposed all of the other bugs. Those bugs were always there, but the system kept on working because there was enough redundancy that one component was able to compensate for the failure of another component. Sometimes this chain of errors and compensation continues for several cycles, until finally the last protective layer fails and the underlying errors are exposed."

jacquesm5 hours ago | root | parent

I've had that multiple times. As well as the closely related 'that can't possibly have ever worked' and sure enough it never did. Forensics in old codebases with modern tools is always fun.

magicalhippo3 hours ago | root | parent

> As well as the closely related 'that can't possibly have ever worked' and sure enough it never did.

I had one of those, customer is adamant latest version broke some function, I check related code and it hasn't been touched for 7 years, and as written couldn't possibly work. I try and indeed, doesn't work. Yet customer persisted.

Long story short, an unrelated bug in a different module caused the old, non-functioning code to do something entirely different if you had that other module open as well, and the user had disciverdd this and started relying on this emergent functionality.

I had made a change to that other module in the new release and in the process returned the first module to its non-functioning state.

The reason they interacted was of course some global variables. Good times...

jacquesm3 hours ago | root | parent

Global variables... the original sin if you ask me. Forget that apple.

roenxi6 hours ago | parent | next

> See, for instance, the space shuttle O-ring incident

That wasn't really a result of an alignment of small weaknesses though. One of the reasons that whole thing was of particular interest was Feynman's withering appendix to the report where he pointed out that the management team wasn't listening to the engineering assessments of the safety of the venture and were making judgement calls like claiming that a component that had failed in testing was safe.

If a situation is being managed by people who can't assess technical risk, the failures aren't the result of many small weaknesses aligning. It wasn't an alignment of small failures as much as that a component that was well understood to be a likely point of failure had probably failed. Driven by poor management.

> Fukushima

This one too. Wasn't the reactor hit by a wave that was outside design tolerance? My memory was that they were hit by an earthquake that was outside design spec, then a tsunami that was outside design spec. That isn't a number of small weaknesses coming together. If you hit something with forces outside design spec then it might break. Not much of a mystery there. From a similar perspective if you design something for a 1:500 year storm then 1/500th of them might easily fail every year to storms. No small alignment of circumstances needed.

cpgxiii3 hours ago | root | parent | next

In reality the "swiss cheese" holes for major accidents often turn out to be large holes that were thought to be small at the time.

> [Fukushima] No small alignment of circumstances needed.

The tsunami is what initiated the accident, but the consequences were so severe precisely because of decades of bad decisions, many of which would have been assumed to be minor decisions at the time they were made. E.g.

- The design earthquake and tsunami threat

- Not reassessing the design earthquake and tsunami threat in light of experience

- At a national level, not identifying that different plants were being built to different design tsunami threats (an otherwise similar plant avoid damage by virtue of its taller seawall)

- At a national level, having too much trust in nuclear power industry companies, and not reconsidering that confidence after a number of serious incidents

- Design locations of emergency equipment in the plant complex (e.g. putting pumps and generators needed for emergency cooling in areas that would flood)

- Not reassessing the locations and types of emergency equipment in the plant (i.e. identifying that a flood of the complex could disable emergency cooling systems)

- At a company and national level, not having emergency plans to provide backup power and cooling flow to a damaged power plant

- At a company and national level, not having a clear hierarchy of control and objective during serious emergencies (e.g. not making/being able to make the prompt decision to start emergency cooling with sea water)

Many or all of these failures were necessary in combination for the accident to become the disaster it was. Remove just a few of those failures and the accident is prevented entirely (e.g. a taller seawall is built or retrofitted) or greatly reduced (e.g. the plant is still rendered inoperable but without multiple meltdowns and with minimal radioactive release).

drob51857 minutes ago | root | parent

I’m not sure why you think those are not a confluence of smaller events or that something outside the design spec isn’t one of those factors. By “small,” I don’t mean trivial. I mean an event that by itself wouldn’t necessarily result in disaster. Perhaps I should have said “smaller” rather than “small.” With the O-rings, the cold and the pressure to launch on that particular day all created the confluence. With Fukushima, the earthquake knocked out main power for primary cooling. That would have been manageable except then the backup generators got destroyed by the tsunami. It was not a case of just a big earthquake, whether outside or inside the design spec, making the reactor building fall down and then radiation being released.

amelius7 hours ago | parent

It usually starts with a broken coffee machine.

drob51845 minutes ago | root | parent

When that happens, get ready.

#visit	13,199,433
#session	74,665
#live-session	0