As we are all aware, but rarely consider, the global IT landscape of today, is more interconnected than ever, with businesses and critical services relying on vast networks of cloud-based solutions to operate efficiently. Yet, as always, there is a flip side, meaning that any disruption can have catastrophic effects across multiple sectors, a fact that was impressed upon us by CrowdStrike’s recent faulty software update. The event was a loud wake up call to double check our software infrastructures and implement more robust cybersecurity measures.
How it Went Down: A Chain Reaction of Failures
On July 19, 2024, a routine update from CrowdStrike inadvertently triggered a global IT crisis. The update, designed to enhance security, caused widespread crashes on systems running Microsoft Windows. Practically every institution under the sky faced disruption, such as airports experiencing delays, financial institutions facing service interruptions, and media outlets struggling to broadcast live content.
Adding to the complexity, Microsoft released an update to its Azure cloud platform at the same time, compounding the problem. As such, This incident highlighted the critical risk of single points of failure in our cloud infrastructure, where a minor fault can lead to widespread disruptions, similar to a house of cards.
What Does It Mean for Our Current Cloud Infrastructure
The CrowdStrike incident underscores a broader issue within our current cloud infrastructure—how access to kernel space versus user space can impact system stability.
In any operating system, the kernel space is the core component responsible for managing hardware and system resources. Software operating in the kernel space has full control over the system, meaning that any failure in the kernel can cause the entire system to crash, as was evident with the “Blue Screen of Death” triggered by CrowdStrike’s update.
Traditionally, most software operates in user space, where it interacts with the system via controlled APIs provided by the operating system. Crashes in user space are typically contained, affecting only the program in question, not the entire system. However, certain applications, particularly security software, require access to kernel space to perform their functions effectively.
Historically, Microsoft attempted to limit third-party access to kernel space to enhance security, introducing features like PatchGuard to protect the kernel from unauthorized modifications. However, this effort faced pushback from security companies that argued their software needed kernel access to function effectively. The result was a compromise that allowed continued kernel access, which contributed to the vulnerabilities exposed by the CrowdStrike incident; a faulty update causing an issue in the kernel space, leading to millions of systems crashing globally As such, the situation becomes a catch 22, where access to the kernel space is necessary for security purposes, yet the dangers of doing so are now evident.
Furthermore, this incident also highlights the broader issue of software supply chain vulnerabilities. Companies like CrowdStrike and Microsoft have direct access to the systems of countless organizations worldwide. This access, while necessary for timely updates and security patches, presents a vulnerability. A single faulty update, as demonstrated, can cause widespread disruptions, underlining the need for rigorous testing and risk assessment before deploying updates, particularly those affecting kernel space.
Building Resilience in Cloud Infrastructure
In light of this incident, significant changes are needed to improve the resilience of cloud infrastructure. Companies must take proactive steps to ensure their IT systems can withstand similar incidents in the future. Here are some strategies that can be implemented:
- Routine Vulnerability Assessments: Regularly conduct thorough evaluations of the entire IT infrastructure, including third-party dependencies, to identify and mitigate potential risks before they escalate.
- Distributed System Design: Adopt system architectures that minimize reliance on single points of failure by spreading resources and functions across multiple nodes or regions.
- Strengthened Vendor Collaboration: Enhance communication and coordination with service providers to ensure rapid response and resolution in the event of an issue.
- Investment in Innovation: Focus on developing self-repairing systems that can detect and correct issues automatically, reducing the need for manual intervention.
- Proactive Risk Management: Implement strong governance frameworks that prioritize redundancy, resilience, and risk management in critical IT operations.
How Axelliant Can Help
At Axelliant, we understand the critical importance of resilient IT infrastructure. As a partner of both CrowdStrike and Microsoft, we are uniquely positioned to offer solutions that address the vulnerabilities highlighted by this incident. Our expertise in cloud solutions and cybersecurity allows us to provide our clients with the tools and strategies they need to build robust, resilient IT systems.
We offer comprehensive IT audits and risk assessments to identify potential vulnerabilities in your infrastructure. Additionally, our services include the design and implementation of distributed systems that reduce the risk of single points of failure. Through enhanced communication protocols and cutting-edge R&D, we help our clients stay ahead of potential issues, ensuring that their operations remain uninterrupted, even in the face of global IT disruptions.
The CrowdStrike outage is a reminder of the ever-present risks in today’s digital landscape. At Axelliant, we are committed to helping our clients navigate these challenges with confidence, providing the expertise and solutions needed to ensure their IT systems are secure, resilient, and ready for whatever the future may hold.