Bug Causes Global CrowdStrike Outage Chaos, Says Company
The company has blamed a bug in CrowdStrike's test software which led to a catastrophic global IT outage, affecting 8.5 million Windows systems
Krishna Murthy July 24, 2024
Share on LinkedInShare on Twitter
CrowdStrike Holdings Inc has attributed the global IT outage last week to a bug in its test software. The CrowdStrike outage affected 8.5 million Windows systems mostly used by airports, hospitals and large tech firms on July 19, 2024.
The cybersecurity company’s latest revelation came from a Preliminary Post Incident Review (PIR) included in its updated remediation guide, explaining the series of events that led to the disruption.
The PIR update on Wednesday mentioned that the core of the issue was in the “Sensor Content” shipped with CrowdStrike’s Falcon Sensor, which defines its capabilities and is updated via “Rapid Response Content” to address new threats. This software relies on “Template Types” and “Template Instances” to map specific behaviors for the sensor software to detect or prevent threats.
CrowdStrike Outage Investigation in Detail
In February 2024, CrowdStrike said that it introduced a new “InterProcessCommunication (IPC) Template Type” that the vendor designed to detect “novel attack techniques that abuse Named Pipes”.
Following successful testing on March 5, 2024 multiple IPC Template Instances were released between April and July.
“Subsequently, three additional IPC Template Instances were deployed between April 8, 2024 and April 24, 2024. These Template Instances performed as expected in production,” the PIR said.
The July 19 release, however, contained “problematic content data” due to a bug in the Content Validator, leading to an out-of-bounds memory read that triggered system crashes.
CrowdStrike’s PIR revealed that the assumption of the July 19 release’s stability, based on prior successful tests, was flawed. The unanticipated exception caused widespread Windows OS crashes, affecting critical operations globally, from airlines and banks to stock exchanges.
The incident report includes promises to test future Rapid Response Content more rigorously, stagger releases, offer users more control over when to deploy it, and provide release notes.
The fiasco prompted an apology from Shawn Henry, CrowdStrike’s Chief Security Officer, who acknowledged the failure in a LinkedIn post, stating, “The confidence we built in drips over the years was lost in buckets within hours, and it was a gut punch.”
What’s Next for CrowdStrike?
The US-based cybersecurity company is now undertaking a comprehensive review to understand the full extent of the incident, which brought down operations across various sectors, including emergency services and banking, particularly in Hong Kong and the UK. Microsoft and CrowdStrike have since rolled out fixes, restoring many systems.
CrowdStrike regularly makes security content configuration updates to observe, detect, or prevent malicious activity. The problematic update, however, carried an undetected error, leading to the crashes. The company has pledged to enhance its testing protocols for future Rapid Response Content. This includes implementing a new check to fix the faulty Content Validator and adopting staggered deployments, known as canary deployments, to ensure updates are tested piecemeal before widespread rollout.
Additionally, CrowdStrike plans to give customers more control over content delivery, allowing them to choose when and where updates are deployed. This move aims to prevent similar incidents and rebuild customer trust.
The fallout from the outage was significant, with CrowdStrike’s shares plummeting nearly 30%, erasing billions from its market value. The US House Committee on Homeland Security has requested an appearance from CEO George Kurtz to explain the measures the company will take to mitigate future risks.
Henry reiterated CrowdStrike’s commitment to learning from this incident and improving its processes to ensure such a failure does not recur. The company is determined to restore its reputation and customer confidence by addressing the root causes and implementing robust safeguards.
The CrowdStrike incident underscores the critical importance of rigorous testing and validation in cybersecurity software. The global disruption caused by a single faulty update highlights the potential widespread impact of such vulnerabilities. Moving forward, CrowdStrike’s efforts to improve its testing and deployment processes will be crucial in preventing similar incidents and maintaining the integrity of their cybersecurity solutions.