Note: these are purely my opinions, not those of CyAN or any of my clients.
Last week, security vendor CrowdStrike deployed an update to its Falcon EDR (Endpoint Detection and Response) software agent to millions of systems running Microsoft Windows. Due to a faulty system file, affected hosts suffered “blue screen of death crashes”, leading to major impact across multiple critical industries. Microsoft Azure went down, multiple airlines suspended flights, hospital operations were slowed, while IT operations teams scrambled to undo the update, often manually across thousands of machines. Microsoft has since issued a repair tool.
Fellow CyAN member, and member of the governing board, Kim Chandler McDonald, recently posted an article on her LinkedIn feed about the dangers of software monoculture, with parallels to risks from monocultures in nature. Kim also shares recommendations for good practices on mitigating monoculture risks in the first place – as with all of her pieces, it’s worth a read.
The monoculture issue is a known one, and has been a known problem with various types of software. For example, since Microsoft Windows established dominance across the global desktop and much of the server market, cyberattacks that are often commoditized have caused significant and widespread problems1.
To its credit, Microsoft has been a demonstrably “good citizen” since introducing patch Tuesdays in the early 2000s, and becoming increasingly communicative about vulnerabilities and bug fixes, not to mention being an active and helpful part of the global anti-cybercrime ecosystem, which has significantly reduced the risk from another Code Red worm. That said, breaches of widely used software platforms such as SolarWinds, as well as critical vulnerabilities in omnipresent components such as Log4J underscore that monocultures engender massive systemic vulnerability.
As much as I agree with Kim’s article, I believe that what we know so far (!) about the Crowdstrike issue merits a few additional discussions.
First, that of vendor software quality assurance. Now comes the hindsight armchair quarterbacking. Ready?
Here’s an analysis of how what supposedly caused the bug in the Falcon update. Here’s another one casting doubt on that explanation (apologies for Twitter links), and speculating about a reason why Crowdstrike QA didn’t catch the problem. A post asking why kernel drivers are written in C/C++, which can have memory issues (but referencing the first post – I make no comment on the accuracy of any of these). Another comment (in my opinion rightfully) stating that no EDR software should rely on kernel mode drivers. I’ve also read comments in numerous private groups I’m a member of claiming that Crowdstrike did not perform, at least not sufficiently, code fuzzing, and that it bypassed Microsoft driver signing – admittedly not required by Microsoft outside of machines using secure boot.
Software will always have bugs and security vulnerabilities, which is one reason why the debate about vendor liability has been ongoing since at least the late 1990s. Crowdstrike is not the first major security vendor to cause outages with updates – among others, Kaspersky and McAfee in 2010 (bizarrely, when Crowdstrike’s current CEO was CTO at McAfee) have both caused problems with updates. Software vendor contracts are usually very clear about refusing any liability for potential bugs.
Like many other tech vendors, Crowdstrike laid off significant numbers of staff in 2023, including a lot of engineers, developers, and QA testers. On its own, this is already concerning. I’ve always supported stricter regulations regarding supply chain risk management and vendor responsibility for processes and adequate resource investment for risk mitigation.
Perfect testing and QA procedures with enough resources would not have guaranteed that a defective update didn’t go out. But a defective update did go out, and it did take out millions of important systems. Thus, something went very, very wrong. QED. No argument about technical, procedural, or legal finesses will change this.
Second is the equally important topic of operational risk management. Without victim blaming, this is partially the fault of Crowdstrike users, and strongly related to the software monoculture issue. Just as it’s impossible to guarantee security of a system, it’s also very difficult to retain a complete overview of all IT assets, versions, dependencies, etc. in complex environments.
Nonetheless, the sheer level of chaos introduced across multiple industries by a single component failing, and the difficulty in recovering from it (many systems are still offline and expected to remain so for some time) points to a systemic failure of operational risk awareness and incident response playbooks at firms affected by the bug.
Regulations such as the EU’s NIS2 Directive, and good practice standards, have focused primarily on third party information security risk management. More advanced organizations have successfully quantified infosec risk and mapped it to other, more fundamental risk management metrics. The CrowdStrike outage demonstrated conclusively that this is not the case for a lot of operational IT risk – basically, what are my dependencies, how bad would it be if one of them failed, and how can I mitigate or at least work around it?
Even if you can’t guard against a buggy update of a vital system component, you should at least know what those components are, how valuable they are, and how to work around their failure. A colleague recounted an example of a client that, having been burned by two recent breaches, had implemented backup processes using pen and paper to ensure continuity of operations in case of a major IT outage. This was neither optimal nor efficient, but it worked.
I’m a pessimist, and I like placing blame for failures on leadership when there’s even a hint that negligence or underinvestment in the pursuit of higher profits were weighty contributing factors to a crisis. That’s part of why CEOs are paid a lot of money. While the sheer size of the problem strongly screams that someone screwed up, there’s always the possibility that this was not the case. Nonetheless, I firmly believe that the coming months will see a torrent of lawsuits, legislative investigations, and stricter regulations around all of the above issues – software monoculture, resilience testing and QA, and operational risk management. I am convinced that these will cause major problems not only for CrowdStrike, but also for many of its clients in critical sectors whose readiness was found to be lacking.
The worst case scenario is that C-levels will draw the wrong conclusions and somehow blame their information security teams. This was not a security issue.
- There’s an argument to be made here about the prevalence of Linux, and how it runs a sizeable portion of critical infrastructure. The breadth of Linux flavors and vendors, as well as its use for a lot of the world’s most important IT systems that are frequently hidden deep in the backend from not only “average users” but even a lot of C*Os means that it’s almost impossible to get good statistics on Linux usage outside of when it’s used for web servers, or in mostly public-sector large desktop deployments. This diversity, on top of Linux’ strong security-conscious architecture, and the greater exposure of open source software to scrutiny by bug hunters, makes it far more resilient and resistant to major security threats. Most of the time. ↩︎