Sunday 21 July 2024

Opinion: The issues the CrowdStrike Incident brings to mind

We are not all perfect

In my 36th year of a career in Information Computer Technology of which more than two-thirds have been as a freelance consultant, even in Nigeria where some thought it folly to abandon a salaried role for the uncharted waters of treading the streets of Lagos and beyond for engagement, I have been quite fortunate.

I would be the first to say every seemingly subject matter expert has a history and that is in keeping with the credo that every saint has a past and every sinner has a future. My recognition of some of the stupidest things to do in any IT environment allows me to reflect on the imperfections, errors, ignorance, overconfidence, mistakes, and failings of experts. Critically, we should learn lessons more and excoriate ourselves less.

CrowdStrike’s bird strike

However, on Friday I was hoping so much that my weekend would not be ruined by the incident of an event that we would need to remediate, especially when what caused it was totally out of our control. I refer to the 2024 CrowdStrike incident where a US cybersecurity company rolled out a faulty update to their software which affected an estimated 8.5 million Microsoft Windows devices.

In the process, we saw this affecting airlines, airports, hospitals, banks, hotels, payment systems, government services, enterprise systems, and emergency services, among the critical services the CrowdStrike software is widely deployed to protect.

The more technical people have already analysed the issue and several solutions have been proffered including an emergency recovery tool that Microsoft released late yesterday to help fix devices that have been left in an unusable state. [Microsoft: New Recovery Tool]

How ever did this happen?

In my view, I am shocked that this CrowdStrike update happened and calls for a reassessment of how we measure impact, risk, and consequence when what we do can be so far-reaching and the means to back out or roll back a presumed solution requires extraneous measures.

If anything, and I have been involved in major deployments that could reach up to 250,000 users globally, you do not roll out a major update on a Friday and I never do on a Monday either. You need the presence of mind and personnel active during the week if things go wrong.

Before that, all techies are left asking, how did such a fundamentally flawed update make it out of the gates at CrowdStrike without being caught in testing, review, change management, risk management, impact assessment, and just the basic corporate desire never to roll out a problem regardless of the situation?

Preoccupied with the stock market

I got one interesting insight looking through the Twitter (X) feed of the President & CEO CrowdStrike, his last tweet was quoting another before pandemonium broke loose on the 8th of July, and it was the ululation about CrowdStrike being the seventh best-performing stock in the Nasdaq 100 year-to-date and the 14th best in the S&P 500. In both indices, it is the highest returning software stock of the first half of 2024, up over 50%.

CrowdStrike seemed to be a celebratory mode and last month they celebrated 5 years of being Nasdaq-listed, I hate to think that they had taken their eyes off the ball and by some careless misadventure, a company that was supposed to prevent cyberattacks presided over one of the largest outages ever in the history of information technology.

It leads me to think of the nursery rhyme, Sing a song of sixpence where the king specifically had a counting house to count his money and the maid suffered the mishap of having a blackbird which would have been one of the four-and-twenty birds that were baked in the pie which escaped when the pie was opened and the birds began to sing. Things were not particularly right in that kingdom.

Falconry gone to ground

That this outage affected airports and airlines globally is quite interesting because the update was to CrowdStrike’s Falcon Sensor product. This was a vulnerability scanner that rendered devices totally inoperable. Certain airports deploy falcons to scare away birds that might interfere with the take-off and landing of aircraft. They prevent catastrophic bird strikes that could incapacitate aircraft and lead to accidents.

I find myself thinking CrowdStrike had become a bird strike of unimaginable consequence that the cost of the outages is yet to be computed as many devices might still be offline. CrowdStrike stock price fell almost 20% in the 5 days to the Friday close of the market. The king of CrowdStrike counting his money just over a week before just took a personal hit of $43 million it might be up to $300 million according to Forbes. Not much for a billionaire though.

Ticking boxes and flipping heck

Back to the fundamentals, the question about testing remains as much as I am left wondering what product, service, or project manager needed to tick boxes to meet deadlines over rolling out a patch later than planned to achieve something. Siding with the techies rather than management, could the management have been given different advice, but the techie was overruled for expediency purposes?

I have had these conversations too many times with project managers who have promised the world to management long before they have engaged the input of resources and facilities to get things done to the standard they have promised. The resource is then put in a bind to meet unrealistic deadlines.

You need a force of personality to push back and assert that your job is to deploy solutions that work the first time, maybe with a few tweaks, but you would never roll out what you can determine with all clarity still has issues and can constitute a problem. I do not want to screw up anyone’s project, but I have a professional responsibility to those I provide service and support to not to leave the state of their corporate devices any worse than before my solution was deployed.

Push back and regulation too?

Better late than sorry is not a sin, it is understanding the impact and risk of what you do. One last thing, the update should have gone out in controlled tranches, not globally in one fell swoop. I can see a situation where legislation might require those who can impact critical services to submit a full assessment and deployment plan to a regulator before deployment.

We might be a Federal Aviation Administration (FAA), Securities & Exchange Commission (SEC), or a Food and Drug Administration (FDA) type agency at national, regional, and international global levels to superintend services that can affect global infrastructure along with teeth to regulate, sanction, or punish those who handle their impactful responsibilities with levity.

No comments:

Post a Comment

Comments are accepted if in context are polite and hopefully without expletives and should show a name, anonymous, would not do. Thanks.