In the video linked below, Dave Farley talks about the "meta-level" of what we should learn as an industry from the recent global outage, caused by the Falcon security product from CrowdStrike.
Apart from the overall danger that comes with running software on the kernel-level, we simply need to assume that sooner or later something will go wrong.
The latter may be because of a programming error. So assuming I am the developer, that is something I need to try really hard to avoid it. But it is the relatively easy part.
When we take a broader perspective, this is not only about proper software engineering. Instead my view is that we are talking about BCM (business continuity management). And that comprises much more:
- The software that I am responsible for
- The rest of the computer system (hard-/software)
- Environmental conditions like possible power outages
- Recovery/roll-back mechanisms on the technical level
- Worst case: Alternative processes if all else fails
There is certainly more to say, but that is a good starting point.
My "mantra": Be paranoid, assume failure, and have multiple levels of resilience (provided the consequences of failure warrant all this).
 |