A small update brought down millions of IT systems around the world
Nathan (Australia): This weekend's global IT outage caused by a software update gone wrong highlights the interconnected and often fragile nature of modern IT infrastructure. It demonstrates how a single point of failure can have far-reaching consequences.
The outage was linked to a single update automatically rolled out to Crowdstrike Falcon, a ubiquitous cyber security tool used primarily by large organisations. This caused Microsoft Windows computers around the world to crash.
CrowdStrike has since fixed the problem on their end. While many organisations have been able to resume work now, it will take some time for IT teams to fully repair all the affected systems – some of that work has to be done manually.
How did this happen?
Many organisations rely on the same cloud providers and cyber security solutions. The result is a form of digital monoculture.
How was Microsoft involved?
When Windows computers everywhere started to crash with a “blue screen of death” message, early reports stated the IT outage was caused by Microsoft. In fact, Microsoft confirmed it experienced a cloud services outage in the Central United States re gion, which be g an around 6pm Eastern Time on Thursday, July 18 2024.
What do we learn from this episode?
Don't put all your IT eggs in one basket. Companies should use a multi-cloud strategy: distributing their IT infrastructure across multiple cloud service providers. This way, if one provider goes down, the others can continue to support critical operations.
Companies can also ensure their business continues to operate by building in redundancies into IT systems. If one component goes down, others can step up.
This includes having backup servers, alternative data centres, and “failover” mechanisms that can quickly switch to backup systems in the event of an outage. Automating routine IT processes can reduce the risk of human error, which is a common cause of outages. (AP)